Eric Jang – Building AlphaGo from scratch - podcast episode cover

Eric Jang – Building AlphaGo from scratch

May 15, 20262 hr 37 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

Eric Jang discusses re-implementing AlphaGo with modern AI tools, highlighting Go as a clear example of intelligence primitives like search and self-play. He contrasts AlphaGo's efficient MCTS with the credit assignment problem in LLM-based reinforcement learning, explaining why MCTS is more sample-efficient. The episode also explores how LLMs can automate aspects of AI research, such as hyperparameter tuning and experiment execution, while identifying current limitations in problem selection and escaping dead ends.

Episode description

Eric Jang walks through how to build AlphaGo from scratch, but with modern AI tools.

Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn.

Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second.

Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside.

Watch on YouTube. Read the transcript.

And check out the flashcards I wrote to retain the insights.

Sponsors

* Cursor‘s agent SDK let me build a pipeline to generate flashcards for this episode. For each card, I had an agent read the transcript, ingest blackboard screenshots, generate an SVG visual, and run everything through a critic. A durable agent is much better at this kind of work than a chain of LLM calls, and Cursor’s SDK made it easy. Check out the cards at flashcards.dwarkesh.com and get started with the SDK at cursor.com/dwarkesh

* Jane Street gave me a real deep-dive tour of one of their datacenters. I got to ask a bunch of questions to Ron Minsky, who co-leads Jane Street’s tech group, and Dan Pontecorvo, who runs Jane Street’s physical engineering team. They were willing to literally pull up the floorboards and take out racks to explain how everything works. Check out the full tour at janestreet.com/dwarkesh

Timestamps

(00:00:00) – Basics of Go

(00:08:17) – Monte Carlo Tree Search

(00:32:04) – What the neural network does

(01:00:33) – Self-play

(01:25:38) – Alternative RL approaches

(01:45:47) – Why doesn't MCTS work for LLMs

(02:01:09) – Off-policy training

(02:12:02) – RL is even more information inefficient than you thought

(02:22:16) – Automated AI researchers



Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Transcript

Basics of Go

C

Today I'm here with Eric Jang, who was most recently Vice President of AI at OneX Technologies, before that senior research scientist at what is now Google DeepMind Robotics. And you've been on sabbatical for the last few months. One of the things you've been doing is rebuilding and improving and hacking on AlphaGo. And so we're today what we're going to do is you're going to explain building AlphaGo from scratch and what it tells us about the future of AI research and development.

But uh before we get to that, why is AlphaGo interesting? Why is this why is this the project you decided to do on sabbatical rather than just hanging out at the beach?

B

Sure, yeah. Um I like making things and AlphaGo and Go AI is one of those things that really got me into the field. Uh when I saw the kind of early breakthroughs um on Alpha AlphaGo in twenty fourteen, twenty fifteen, twenty sixteen and so forth, it was just profound to see, you know, how smart AI systems could become and the

the kind of computational complexity class that they could tackle with deep learning. Um this is a problem that has, you know, long been understood to be kind of intractable for search, and yet um it was solved um through through deep learning. And so so that was quite mysterious to me. And I've always wanted to understand that phenomena a little bit better.

My training is often in deep neural nets for robotics, where it's uh the the decisions made by the neural networks are a bit more intuitive, but AlphaGo is a sort of problem where the the decisions are actually the result of a very, very deep search. And it's always been very mysterious to me how like a 10-layer network can sort of amortize the simulation of something so so uh so deep in the in the game tree.

C

No, interesting.

B

So if you plot out how much compute it took to build various iterations of strong GoBots over the years, you can see that in 2020 there was a open source project called Katago by David Wu from Jane Street who who basically achieved a forty X reduction in compute needed to train a really strong GoBot Table Raza.

I'm not certain if it's stronger than Alpha Go zero or AlphaZero Mu Zero, uh but it's very, very strong. And this is what most Go practitioners today train against when they're when they're playing an AI. And thanks to LM coding what took a whole team of research scientists at D mined and you know, millions of dollars of research and compute can now be done for, you know, a few thousand dollars of of rented compute.

C

By the way, if you're listening to this on an audio platform, this is a Blackboard lecture, so I highly recommend switching over to a video platform like YouTube if you can to look at the math and the graphs and the go board. Okay. I guess we should first discuss how Go works. Great. So yeah, how how does the the game work?

B

So the the Game of Go is a very simple one that can be implemented quickly and easily in a computer. The objective of the game is basically to put down black and white stones and try to occupy as much territory in the game as possible. So I might start by putting down a black stone. Uh black always goes first. So go ahead. And so the way you capture an opponent's stones is that for every um intersection, if you can surround all four of its neighbors with um with your stones.

then um then this one is sort of cut off from oxygen, if you will, and then it uh and it it i it it is it's it is a dead dead stone. So so then now I control these four stones as well as this empty intersection here. So there's like slight variations between Chinese, Japanese, and uh what is called Trump Taylor rules. Um Trump Taylor rules are designed to be completely unambiguous for Go, so this is what all Go AIs train against and resolve against.

So in typical Go, like where the humans play, you're actually not allowed to put this white stone down here. It would be instant suicide. Um in Trump Taylor, it's actually fine. You put it down and then it immediately resolves to death. So the outcome is sort of the same. Let's go ahead and start over and play play a few stones and then I'll explain some more. So I'll just start there.

C

I'm like basically playing randomly here, but I'm trying to get around your stones and see if I can

🔇 Silence

B

So this move um basically exposes one empty neighbor for your whitestone, and it's very akin to a check in chess, where if you don't respond immediately by putting anyone here, then I can immediately capture it.

C

I see okay.'Cause it is it is sort of the diagonals that determine whether you want it

B

The cross section, not the diagonals. So so this one is surrounded on three sides. Yeah. And so um you're at threat of losing that stone if you don't play one immediately there. Now you can see that I'm starting to pressure you because by putting a stone here, now you are forced to put one here.

C

Otherwise you would have this two block.

B

Aaron Powell And then if you think think through like what happens if you were to respond here, you can probably you know search into the future and deduce what I'll do in response uh once you once you

C

Yeah. I'm guessing you'd put the black here.

B

That's right. And then I would capture all three of these stones.

C

I should just assume that this is gone. This little block is gone. Yes.

B

So in Go, it's actually okay to let an opponent capture um some stones if, for example, it allows you to position uh to capture more stones in somewhere else on the board. And and this is what makes Go a very beautiful game, is that um you can kind of Uh lose the battle but win the war. Right. And and as the board size increases, the complexity of these kind of like micro versus macro dynamics uh gets gets more.

C

But presumably you'd put one here. Mm-hmm.

B

Yeah. And so now I would capture this entire group. Okay. And this this would be mine. Okay. There's one more um uh case that I wanna demonstrate, which uh actually I had a bug in my code uh recently. Which is uh the following situation. So let's consider a formation like this, right? And then you know we have other pieces on the board in play or whatever. Um

A

And so

B

Um let's talk a little bit about how the game ends. In this territory, who controls these these areas? Is it white or is it black? It's actually black because I have actually surrounded this whole area. Yeah. And it's um very uh assuming I have like other black stones here, it's actually very hard for you to break this out of the control of

C

Yeah. So i when the final score is tallied would would these ones also count as being in um

B

Yeah, great question. So so um this is where different rule sets have different ways of scoring, and so we should talk a little bit about how like uh you resolve scores between humans and how you resolve scores between uh computer code. Uh because there's actually some ambiguity in how humans evalu evaluate this.

So most humans would look at this board configuration and conclude that like black has kind of s totally surrounded white and so white has no chance of life. We could play out more here, but then at the end I would capture everything. Um however if you have a way of breaking this formation and connecting white to something outside of it, then it can flip. And so this is where it's you know a little bit Hard for a computer to decide these kind of things, right? So how do humans do it? Um

Humans basically say, uh, I think the game is done. And then you th you have to also say, I think the game is done, and then we'll say, like, I think these are these are milestones, and then you have to agree. If you don't agree, then we keep playing. So um essentially once two humans, their uh so-called value function um agree on a consensus, then the then the Chinese rules uh resolve that.

So in Tromp Taylor scoring, um it's perfectly unambiguous, so it can be decided, you know, algorithmically by a computer. So if if let's say you you have this at the end end game. The way you score this is that you first count how many stones you control, and that's unambiguous, then you count how many empty intersections that are not touched by your opponent's stones.

So these intersections would not count for either player because both of all of these intersections are connected to both white stones and blackstones. If um this were like this, then white would get three points. Now this is uh a little uh odd because uh human would know that white is actually losing these points. But um Trump Taylor's scoring would consider White to have all of these points as well as these points.

Got it, okay. So that is a very big difference in um how computer go scores things and how humans score things.

C

This is a game, man.

B

The game ends when either a player chooses to resign or both players pass consecutively. Cool. Yep. So that's the rules.

C

All right. Now help me crack this with AI. Great.

Monte Carlo Tree Search

E

Absolutely.

C

Let's understand how um AlphaGo actually works and how somebody in the audience might be able to implement it.

B

Great, yeah. Let's start with um kind of an intuition about the underlying um you know search process used to make moves. And we'll layer on uh ideas from deep learning to make it much more efficient and tractable. So Go is a game where there's just two players. We're gonna draw a person here, and we're gonna draw an AI here. And um let's say this person is playing black, so they go first. So we're gonna draw.

🔇 Silence

B

And then now the AI is going to make a move based on what it sees here. So there's a question of like how you encode these inputs into the AI. Maybe you could use ones and zeros, but you want to represent um you know black, white, and empty. So so yeah, you would need at least three different values here. So maybe you could use zero, ones, and twos or something. So the AI might see something like, you know, zero, zero, zero. Ah, one.

🔇 Silence

B

So this is the input to the AI on its turn. So the AI can choose, let's just pick three possible random moves that I can go, and I just drew these at random. And so which which move is best here, right? Well, we don't know until the game ends. Um there's no Go does not have any kind of local reward of which move here is good. And this is what makes Go a very difficult game, is that you don't actually know who won until you really get to the end of the game.

So how deep is this tree, right? Well, in a 19 by 19 um go board, there are uh you know roughly to the order of 361 moves on any given uh move. And of course as it fills up you have less moves. Um and And the the number of steps in the game can be somewhere from 250 to 300 moves. And maybe experts might uh decide to end the game well before that. But uh you know under Trump Taylor's scoring, you actually have to play things all the way to the end. So this could be like 300 moves or something.

Right. So like 300 um like depth of the tree. Yeah. So if we keep on expanding possible moves here, so th in in this move the AI is going, and then you know here the human would go. And then, you know, there's there's some

🔇 Silence

B

And so forth. You can find that like essentially what you end up with is an enormous explosion in the possible game outcomes originating from just this one state. So this is something to the order of like, you know, 361 to 300, power of 300, which is far more than the number of atoms in the universe.

Right. Like it's it's just uh it's just and of course actually there are redundancies and symmetries, so it's not actually 300, but but that's sort of the if you were to do a naive tree where there were no merging of children, then actually you end up with a tree about this big. Right. Let me uh use this board here. So if we start here and then you play here, and then um I play here and then you play here, that is equivalent to

I start here, you play here, I play here. Yeah. And then you play here. So so both of them arrived at the same spot but through different paths. So this child node can be thought about as a shared ancestor. Yeah.

C

And I guess it's not three sixty it starts at three sixty one but it dec decreases by one each time.

B

And the branching factor decreases by one each time. But in in any case, this is a very, very, very large tree. And uh this is also why, you know, computer scientists for many years thought that Go was not a tractable problem this century, because the amount of compute you would need to exhaustively search every possible possibility. Um is just too large. Um if you could, Go is actually a deterministic game. So um on any given state, you can actually compute what the

uh best possible strategy you can uh you can make is uh to in order to win the game. You can search all the possible futures where you win and then just make sure you always stay in in in that you know set of futures. So AlphaGo's kind of core conceptual breakthrough was using neural nets to make this search problem tractable.

So before we get into how neural networks are involved, let's talk a little bit about how we can, you know, assuming we had a very powerful enough computer, um, search this uh this tree to find the best move. Right. So in the beginning, um you're not going to build out the whole tree.

Uh b because storing that tree would be very expensive. Instead you might do something like interactively figure out which um which leaves of this tree are worthy of exploring and expanding into the future to see you know what else is there. So um there are some early algorithms in uh bandit literature like you know UCB.

uh one, which is not exactly appropriate for a you know sequential game like Go, but very much inspired the action selection um algorithm used in uh in AlphaGo. So so UCB1 looks like on on every move, we're gonna take the best action, uh or you know, the argmax over A that maximizes um um you know the Q of A, and I'll explain what Q of A is in a moment, plus some sort of exploration bonus.

🔇 Silence

B

So on every node, we're going to track a few quantities. So so let's you know consider each of these a node. This is this is the the root node, um where you're making decisions from. And these are the children of the root node. And um we're gonna say each node is basically a data structure. That is um it stores a visit count. of this um this note, this this child note.

C

How often the parent visited this node?

B

And we'll call this an action. So so one thing that is easy to trip on and is like if you come from uh you know robotics or um Other kinds of reinforcement learning is like, where are the actions, right? I'm only talking about nodes. Um nodes here represent states, and because this is a perfectly deterministic game with no randomness. You can actually just infer the action based on the child. So if I go here, that implies an action. And this is the state that we resolve.

The LLMs, if you ask to uh you know vibe code a uh MCTS implementation, it'll most likely design the right data structure here. But um you you know you it's up it's sort of uh chef's choice. You can actually rewrite the the tree structure however you like. This was what um Claude 4.6 wrote for me when I when I asked it, and it was a very reasonable choice. So um so then you know Q represents the um mean. Action. value of this action.

And I'll use a subscript A to denote that this kind of corresponds to taking a specific action to get here, right, from from the from the root node. Um so so like Uh if we if we have root, basically taking a gets us to this this node here. Um and then we're gonna also store the probability of taking this action.

C

I am from the past.

B

From the parent, yes. Like like what are the odds that we sample this one? Yeah. And and this will become relevant later, you know. Like uh we've we've talked about a deterministic tree for now, so I'll I'll bring probabilities into this later. And then finally we have a sort of uh dictionary of children. which is just like uh, you know, more of these notes, you know, in a sort of classic linked list style reference tree. So um this is the basic data structure to implement a tree.

And um in AlphaGo, um they use a slightly different action selection criteria called um pucked, and it's short for predicted upper confidence with trees. And uh this is basically um when you s when you select which which child to take, you do argmax a of Q of S A Plus. Constantly.

🔇 Silence

B

So the equation and forms are actually pretty similar. These are both scoring criteria, right? Like you want to argmax this quantity and you want to argmax this quantity to determine which action to take. So let's break down the intuition of like how you select actions here. This is the mean action value, so how good is a given child on average?

Um and and and if you actually you know knew the whole tree, then uh this is all you need, right, to select the best action. You don't really need to do more than that. But if you're interactively building this tree as you're um figuring out what the Q values should be, then what you actually

have to do is occasionally try some other actions, you know, as a sort of explore versus exploit trade-off. So in both UCB and uh pucked, there is this term here that basically rewards taking actions that you haven't taken before. So as we mentioned before, each node stores the visit count of taking that specific action. So everything is initialized to zero. And so for a given action, let's just say like call it and like action A, um initially it's zero.

And so as n is increasing, if if let's say we've already made 10 uh action selections from that root node, uh, but we haven't picked a yet, then this term actually starts to become quite large for a. And conversely, if we have chosen A ten times out of ten, then now this term is quite small. It diminishes very quickly. And the same thing is actually true here.

C

I'm just gonna make sure I'm understanding it. Uh uh you can think of it conceptually as two different things, the Q and then this exploration term. Let's just be clear about what Q is. Q is basically saying, hey, once we do these rollouts, so you're actually running all these simulations, you go down the tree.

And then you figure out, okay, if I end up at the terminal value of this tree, do I win this game or not? And then you do this, you average whether I win this game or not across all the n you know, the leaves of this tree. fr starting from this node, that average you put in Q. Correct. And so you're saying the Q is basically representing will I win this game or not? Uh what is the probability I'll win this game starting at this node?

That's your sort of um that is your sort of exploit. That is like saying th I've run these simulations, I think this is a good move or not. And then this other term is saying. Um have I explored this branch enough yet relative to the other actions I could be exploring? Or I have already explored. Maybe I think it has a low score, but I just haven't explored that many branch uh leaves of this uh down this.

uh leaves down this uh down this node in this tree. So I should maybe like try this even though the Q, the sort of exploit, is telling me that this is not that valuable. And so you because ln of n grows slower than N uh basically as the over time

you will move from the argmax being dominated by this exploration term, which is the second term here, to the argmax being dominated by the Q term, which is like, okay, I've done enough simulations. I'm quite quite confident that like this is the branch to go down.

B

Yes, that's right. So um the motivation for UCB was to come up with an algorithm where if you don't know the payoff of the arms, uh the the different actions you can select to begin with, this uh strategy Basically, with given some exploration term here, bounds your regret in terms of how wrong you can possibly be. I don't know the proof.

I don't also know if this one is proved to have a logarithmically or or like uh you know square root bounded regret or anything, but I think the algorithm was just derived to look something like this. And you can tell that these terms are a little they grow a little bit differently, and this is actually just to account for the fact that Go has many more actions in every given move compared to your standard bandit problem.

So uh one small clarification to make is that you talked a little bit about simulations on and probabilities and so forth. Um we should remember that Go fundamentally is a deterministic game. So the notion of pro like where does the notion of probability come from here, right? Um Uh if you had a very powerful computer.

There is no probabilities. You just you can just compute the true average of what the the mean action value is. So where does the probability come in? Well it turns out that um uh as in you know computer go before uh alpha go. We've always done some sort of Monte Carlo method where we have some we we take the um expected Q-value averaged over a randomly selected tree. Um and that randomly selected tree is where probabilities come in. So the interpretation of Q is

A

Um

B

What is the expected action value under the um under the random distribution induced by some random search process?

C

Mm.

B

And so where does the random search process come in? That's where uh you know P of action comes in. So if we assume a very naive algorithm where you have a uniform probability of taking any valid action, then this would just be one over the number of valid moves in uh in the in the system. Setup, and you would be kind of taking this average over this very diffuse tree.

Uh this is a valid um integral you can take, but it's very slow because you're gonna consider a lot of trees that have very low value. And uh it's essentially almost like a important sampling problem where you want to there's only a few Actions and tree and and sort of uh paths that can contribute, you know, high value, and almost everything else is low value. So that's a sort of a tricky um problem here.

So, this is the action selection criteria for how you decide which moves to move down. Now, as you move down in TreeSearch, You will eventually run into a node where um it's quite clear you've won or lost, right? At the at the very, very end of the game when when there are no m valid moves to play left under under Trump Taylor scoring, you can decide whether you like, you know. won or lost, right? So you you either win,

And so this is basically um you know the the final return of the whole game. Right. Um and so the the question here is like we we can assign a um a value u. to a t a terminal leaf node of the tree, but how do we assign the values for nodes prior to that, the parent? And it turns out, um, you know, what you simply do is you just take the um your your mean action value is essentially your average. So let's suppose these were leaf nodes. Um sorry, th these were all leaf nodes.

the the mean action value of this node, you know, this action here. is just the average of whether you won or lost at the at the leaf note.

A

And

B

Uh correspondingly you can kind of walk up the chain and say like well the mean action value of this node, let's call this like QB, and this is action B, is just the average of uh a weighted average of these ones here. Yes. Right. And and the weighted average is um It could be dependent on if you have a different sampling distribution or not, but the that the basic intuition is that you want to resolve the game where you have a deterministic win or lose, and then you can kind of

Go backwards, uh, this is called the backup step and assign values to these uh these these these nodes or actions um uh corresponding to the averaged over over the final terminal. Okay, so um if you were to do this without neural networks, it would still be intractable. Um you would s you would have trouble finding, you know, which

um actions to sample. A lot of the actions would contribute very low value, especially if you're like you know trying to fight your way out of a losing position and only a few actions give you high value. So the search in practice is still very, very expensive.

The idea is that like if you can, because Go follows a tree structure, you can actually, you know, inform a very good estimate of the value of this node based on the uh values of uh uh you know downstream, assuming they're all correct and assuming you've searched deep enough.

C

Mm. Your explanation earlier about the um these sorts of states where it's obvious to a human who's gonna win, but it's not obvious to or like you s deterministically you still had to play it out actually drove from the intuition of why the value function both is trainable and to why it's necessary in order to actually ha be able to learn this game effectively. And maybe it's worth defining value in the first place. Sounds

B

Yeah, so we we talked about uh you know this U value being you know your final resolution of whether you won or loss. And this is the terminal leaf node condition. Um

A

Now

B

Humans don't play all the way to the sort of edges of the the tree, the leaves of the tree, right? They kind of stop, you know, some dozens of moves before, maybe maybe even a hundred moves before if in in sort of high level play. So how do they know, right? Like you can think about humans as implicitly having a neural network called a value function that basically takes in uh abort state and then kind of evaluates um you know He went.

And so the human glances at the board and they know like I'm probably gonna lose. And and they're essentially running a neural network that looks at a board, and implicitly they are amortizing a huge number of possible game playouts.

and and taking that average and then deciding whether the board is winnable or not and then whether they should concede or or you know keep playing or not. And uh this is remarkable. If you think about like the um the beauty of something like this. It's like a a neural network in a in a human can somehow

do all of this simulation at a glance and then just know like within a few seconds, without actually playing every single game logically, based on just kind of like crystallized knowledge and experience that like they can do this. And so this gives us a hint that like In games like Go, um, there are ways to basically radically speed up the search process. And this is one of the fundamental.

intuitions behind why AlphaGo works is that you can train a value function to look at a board and quickly resolve the game without playing out all of these trees into the you know into into a very deep search depth.

C

Yep, makes sense. I I will say for the audience, um I sort of found uh For previous episodes when I was prepping and it w seems somewhat relevant to understand how AlphaGo works, I would find it very, very confusing and but it's the kind of thing where once you understand the problem in this way and then you'll build the next few pieces. it is actually much more understandable and it will make a lot of sense.

And it's okay to be confused right now, uh, but it's it's probably simpler to understand by the end of this lecture than you anticipate. So I'll just make that note for the audience.

B

Uh yeah, the important intuition I at a high level, just to you know st step back about where we're going with all this, is that um classically for games like Go, you could build a tree, but we don't have computers powerful enough for that. Yeah. And um Estimating the value of every action that you could possibly take is also hard because you don't know until the end of the game.

You could take averages by playing them to the end, but that's also hard because you don't know which actions to take to sample these averages. So conceptually there's kind of two problems. There's the breadth of the tree and then there's the depth of the tree. And AlphaGo gives us a way to basically uh shrink both of those to be very tractable. That's essentially the kind of core idea behind it.

Okay, so we uh we take this idea that like you know humans can glance at a board and instantly predict whether we win, and maybe that gives us the opportunity to really truncate the how deep we we search. And then you know we also know that humans can look at a board. And um and uh decide, you know Um what what boards, you know?

like intuitively at a glance, what moves might be good on a go board. So these are kind of two things that we can use deep neural networks for to accelerate this search process. Um let's go back before we talked about neural nets, let's just go back to how this playout works. We've only talked about making one move, right? So so the AI looks at this encoded go board, it has a tree, um, it searches

for you know di deeply into the tree to find out which of its actions might be the best. And then it takes that action. And then now, you know, it goes back to the human. So maybe that now the human sees a a Go board that looks like you know like this. And um and then they um they make their move. So maybe they put um they put their stone here. And then now we Um we go back to the AI. Which now looks at a new encoded board.

🔇 Silence

B

So I've used two to denote the AI's playing as white and one to denote the human playing as black and zero as empty. And then now on the AI's turn, it does the MCTS tree search all over again from scratch. So it throws away this old tree that it searched last round, and now there's a new root node and it begins to search a new. And then so on so forth. So MCTS is basically a you can think about it like a search algorithm that is um deciding what moves to play best, aided by neural networks.

Um and and it's it's done on every every mood. Okay, great. So let's talk about the neural network part of this.

C

And while you're racing, uh another sort of thing that was important for me to understand was This MCTS data structure with nodes and children's of nodes and whatever, um, this is done per move and reinstantiated once a move is made. So A a human makes a move, then the AI looks at this and is trying to basically run a bunch of simulations.

uh to figure out okay what should make move should I make next. And those simulations just a simulation is basically like exploring one more node in this MCTS tree. And at the end, um once all these once all this you know you run a thousand simulations. That informs then this um, I guess as you'll explain, this probability of what move to make next.

That's what you store. You you sort of choose the best move given those probabilities. You discard all of that, then then the next player makes a move and you restart this process at the beginning of every move. Correct.

B

One small addendum, you don't discard all of that. You keep one thing behind that we'll use later.

C

Just like I did for Reiner, I wanted to make flashcards for this episode so that people could retain these concepts. And ideally, an LLM could generate some candidates for me to then refine. But to actually get high quality suggestions, I needed to design a whole pipeline where the AI could take and ingest screenshots of the blackboard at the right timestamps.

and then make SVG diagrams in case visuals were helpful, and then run their writing and drawing through a critic, and then revise the card in response to this feedback. It's very hard to accomplish this just by sacking LLM calls. This sort of step-by-step recipe works much better if you have a durable agent that's been engaging with the task across all the previous stages.

So I used the cursor SDK to spin up an agent for each card. The cursor hardness saved me a bunch of work in designing some custom context scaffold or figuring out how to design tool calls for taking screenshots or making animations. These agents all run in the cloud, so I don't have to worry about leaving my laptop open. I just get an email when I have candidates to review. You can check out my cards at splashcards. You can start building with the agents SDK at cursor.com slash smart.

What the neural network does

B

Okay, so now we have a basic intuition of how moves are made with search. We're gonna talk about how neural networks can speed this up by providing an analog to like the human intuition here. So there's two networks. There is the value network. Which takes in a state and it predicts, you know, am I gonna win or lose? It's a binary classification problem. Then we're gonna have a policy network, which

Uh induces a distribution over good actions to take. So um I'm gonna draw a one-dimensional flattened move distribution, but this is really like you know a a a square kind of grid, right? So um so maybe like it thinks Th these are the kind of probability distribution over good actions. And both of these are uh categorical classification problems, right? So you can train this like any classifier in uh with deep learning, um uh you know, cross-entropy loss.

So the specific architecture does not actually matter too much. I tried a few different architectures, transformers work. ResNets work. For small data regimes, um my experience is that ResNets still kind of outperform transformers and um and kind of give you more bang for the buck at at lower budgets. But this may not be true.

Um they they provide the inductive bias of like local convolutions. And generally transformers start to outperform uh residual convolutional networks when you want more global context. I see that. So one um interesting finding from the Catago paper was that they found it actually quite useful to pool together global features together, um and aggregate global features like uh throughout the network.

to kind of give the network a global sense of how to like connect value from one side of the board to another side of the board.

C

Well what does it mean to aggregate global features?

B

Yeah. So if you have a um go a a very large 19 by 19 goat board and you you know you've got some some sort of battles going on here, then you've got some battles going on here. Um when you pass this through a convolutional neural network, the receptive fields of the convolutional network are going to be good at computing local things and making that invariant.

But um they won't be able to kind of connect these two features easily, right? They need to sort of be pooled together and attend to each other somehow. So the argument about why transformers are good for computer vision tasks, like with uh you know vision transformers and so forth, is that because they have uh sort of global attention across the whole thing, they can more easily draw these connective. But you do need more data there so that you can kind of uh

Learn through data the the sort of invariant local local features. I've tried very hard to make transformers work for this problem because I was kind of curious if transformers would Present some sort of breakthrough in Go and just remove a lot of those tricks. But try as it might, I actually haven't figured out a way to make transformers better than resonance for now.

C

I it it makes sense why transport immers with their like a global pooling of information would be better if you need to consider information that is not just spatially um Uh or yeah, uh CNNs give you a sort of bias that the the things which are next to you are especially irrelevant.

B

And then they're sort of aggregated up.

C

Yeah, exactly. Yes. But supp okay, so for games where it i it isn't that relevant what is happening locally, you just kind of have to consider the whole thing. You're saying transformers would work better. How about games where so the particular talking about the spatial dimension. How about the temporal dimension where right now we're only considering the previous move because it is a deterministic full information game where um uh but what if it was something like poker or diplomacy where

R really a bluff they made a while back is sort of relevant to understanding now and isolating to m decide to make your next movements. You need to consider all those previous states. Would that then change the consideration of what inductive bias is most relevant and which architecture is most relevant?

B

Right. Great question. So Go is a perfect information game. Yeah. And in perfect information games, um, there does exist a Nash equilibrium strategy for which you can do no worse than any other strategy. So um if you know that your opponent has a particular bias, like they they love to play aggressively, you can actually in principle counter that specific strategy better than a national equilibrium policy. But um to counter any given strategy, um, there does exist a single

um national equilibrium that can be decided solely using the current state. So um that that is a design choice that most Go agents uh Alpha Go chose to do, which in hindsight turned out to work very well because the uh Nash equilibrium seems to be superhuman. Like uh like no human strategy seems to be able to beat it.

Now, there are variations of this where you would actually need to consider temporal history. So and and this is a very exciting research area that I I would encourage people to kind of fork my repo and try these things out, which is um if you were to play, let's say, 2v2 go.

then you actually need to model your partner's uh behavior and you like you may not have information on how they play, so you need to aggregate some information on like how they play so that you can respond accordingly. Right. Like these are s uh situations where it's no longer a perfect information game. And then in those cases in in games of imperfect information or partial observability, then you do need some context to build a model.

And and and I think that's a a place where things get very, very exciting in terms of like self play or you know, diplomacy style. Yeah, interesting. Okay, so uh returning back to the neural network, the architecture again is not super important. You can get it to work with transformers, you can get it to work with resnets. I found that for low budget experiments, uh ResNets work a little better. Um you can also use kind of a Carpathi-style auto research.

hyperparameter tuning to make make your architecture pretty good. And so so you don't have to worry too much about that. You just need to sort of set up the problem so that you have a a sort of target optimization. Cool. Yeah. Okay.

So we're gonna pick just a somewhat arbitrary architecture that worked for for you know what I did, but again, this part is not super important. Um you have your encoded board state, and um we're gonna just choose to let's say do Three uh three like you know similar to an RGB, we're gonna have three kind of channels. Uh one channel to include black, one channel to include w white, and then um and then uh one channel maybe to encode like um uh empties or um

Maybe like a masked region if you want to train on multiple board sizes. I'm actually not going to talk about multiple board sizes for now. That's a little bit too complicated. So we'll just say, like, you know, we've got the Two or three channel uh RGB-like image, and then we go into a you know a ResNet.

🔇 Silence

B

And then we have two branching heads. Um one head predicts the the value function, and this is like um a single logit. So it's like R1. And then we have the policy, which is you know R361. So, This is the architecture, and uh we're going to basically train this to predict the outcomes of games.

given the board state. And we're also going to train this to predict what are good moves. So the OG AlphaGo paper, uh or called AlphaGo Li, um initialized this network with a supervised learning data set of expert human play. Later, they remove this restriction by having the model teach itself how to play well. But I find it actually from a matter of like implementation for your audience.

Super, super nice to always kind of initialize your your experiments to something that's easy and then like, you know, get the problem working. before you know trying to bite off the whole thing and learn and tabularize it. You you generally want to kind of initialize just as in deep learning, initi initi initialization is everything, right?

Initialize your research project to something as close to success as possible, especially if you're you know doing something new that you haven't done before. Like always pick something that works and then get it to do something better rather than start from something that doesn't work at all and then you know um try to make it work. So

Um under that philosophy, it's a great idea to start with something that like you know has a good initialization. So we're gonna take human expert plays um and train this model to predict um you know good actions, right? So we're gonna take all of the winning games, uh all the moves in which a a human one and um uh sorry an expert one and then predict those actions.

And then uh regardless of board state, like you know, whether you won or loss, you're gonna predict the outcome. So you might be wondering like, okay, well, some of the early boards, you know, where it n basically only one stone has been put down. How could you possibly know whether who who the winner of this game is, right? Well if you have uh you know hundreds of thousands of games, then in on average you'll probably see that boards that start like this.

have a sort of half of the games that branch off from this will win and half of the games that branch branch off of this will lose.

So that'll actually be fine. When you train this model to predict those, the logit will sort of converge to uh you know 0.5. Um and and so so for these for these things, it's it's sort of expected that once you Train the model, a starting board state will look like 0.5, and then as you progress towards the end of the game, it'll actually look something like, you know, if this is 0.5, the the win probability will sort of either go like this or it'll it'll go like this.

And um and and this is sort of your move number? Yeah. And so as you know get hundreds of steps into the game, it becomes much more clear who's more likely to win or who's more likely to lose under your expert data distribution.

C

I I didn't understand the significance of why the this way of thinking about value is especially relevant to the expert data.

B

It is not relevant to the expert data. It's true for any data that you train it on. Yeah. So if you were to learn a tablet Raza, you would also expect this to h fall out. If you just do this, like so imagine you know you're vibecoding AlphaGo and you um you you gather some expert data sets from like HottaGo online, um or you you know you have a data set of human players and you train this model.

Actually, it turns out this model is already a pretty good Go player. It'll most likely beat most human players, right? So so like if you just take this policy recommendation and take the argmax over its uh If this is the you know pro probabilities, if you take the argmax and you just take this action as your go play, um it'll be a very, very fast go player that doesn't think in terms of like reasoning steps, it just kind of shoots from the hip, and it'll be a very strong go player.

Which is already quite miraculous if you think about like, you know, ten neural network layers, maybe under like three million parameters, can already do something that impressive.

And so you can start this way, and it's important when implementing this to kind of just verify that this is probably true. It's good to verify that your Go rules are implemented correctly, that like you you know you you can run these simulations relatively quickly uh w and and just as almost like a a sort of a Um a checkpoint that like you want to make sure that you can actually do this basic step before you try to layer on more complex uh things like search.

So but yeah, we can do a lot better than taking the raw neural network and playing the moves. And uh this is how we can apply it to Monte Carlo TreeSearch. So let's uh apply the neural network to um to improve Monte Carlo TreeSearch. So we start with our root node.

🔇 Silence

B

And we now have a four-step uh iterative process to do MCTS. So this tripped me up when I was first reading the paper and trying to understand it. But um uh essentially what we're gonna do is we're gonna choose a number of simulations. So like you know, num. Simulations.

And this number varies. This can be, you know, somewhere between two hundred to uh twenty forty-eight. I believe in um in the AlphaGo Lee match, they used tens of thousands of simulations per move because they really wanted to boost the strength of the model as much as possible. But in training you don't actually need too many. And Codigo, I think, uses something on this order as well.

C

They used uh if you watch a documentary, they had a laptop out during the game. They didn't use a laptop itself. It was like on some

B

It was on some TPU pod, I think. Yeah. But now...

C

Lee is not using like one E twenty two flops to do a move, you know.

B

Fair enough. Um Interestingly enough, modern Go bots don't need that much compute at test time. And what we'll actually find out as we talk about how the MCTS policy improvement works is that over time, the raw network actually takes all of the burden of that big TPU pod and just push pushes it into the network. And you can do actu all of that work with one neural network forpass. But but the TPU pod will always add the extra oomph on top. And so that's what they wanted for the Mac.

So we're gonna pick this kind of like num simulations thing. And uh for every simulation, we're going to basically do several things simultaneously. We're going to see which which moves are the best in the current tree. We're going to add extra leaves to the tree if we get to a point where we need to add a leaf. And we're gonna update the action values for for the tree. So that's that's what every every simulation involves these kind of like four-step.

So the four step process is basically selection. Um expansion And backup. So so at the beginning of our Monte Carlo tree search, our tree is very uh basic. It only has the the root node or our current board that our um AI wants to play at. And so We're going to basically select the best action for this. So when this root node is created, we also know that we can evaluate this under our neural network and get the quantities V theta as well as our probability over actions. And I'm gonna say root.

So for all of the actions here, we can create a bunch of children, right? So so this one has um well in this case I'm drawing a three by three board with one one board missing. So basically there are um you know eight possible children uh associated with this root node.

🔇 Silence

B

And each of these has an associated probability of taking that action. So there's P8, P1, P2, et cetera. Okay, so at the beginning of our Monte Carlo tree search, we have our root node, and we can initialize it with some children, right? Because we know it's uh the the policy network evaluated on the root node gives us on a three by three board with one existing uh stone placed, eight possible children that this

AI could take. So with each of the children, our policy network also gives us the probability of selecting that child. So the first step is to do the selection of the tree. And again, this is a very shallow tree. All we have so far is a tree of depth one, essentially. So our first move is to select. by maximizing or argmaxing the pucked criteria, which is basically, you know, Q um Q S A plus um C puck.

times P of A divided by N over one plus N A. So for each of these, we're going to uh you know NA is zero for for all the actions initially, n is zero. And um and so we're going to basically just you know pick according to this. Um initially Uh what is going to be the you know chosen action here is most likely going to be biased towards um you know the highest likelihood action here, right? Because these are sort of uniform for everybody. So Um let's suppose P1 was the highest probability node.

So you you you selected this one here. Now you got to this node and you realize that it's not a leaf node, right? There are more game it's not a terminal game, so you cannot resolve the the final resolution. So you the next step that you do is um expansion. So you um you will then run this node, this board state, through the policy network. Note that this is the AI's move, right? Like a AI is making this move.

And so when we expand this tree, we're now thinking about what the human might do, or any opponent might do. So this is like your opponent. The tree expansion process actually is completely s uh so so so when we evaluate the um the node here, we're gonna now evaluate the the node from the perspective of this player. Um so then this one has possible actions that we could take.

And uh we we expand basically the the leaf nodes here. So for each of these nodes um that we could you know arrive at, we're gonna now check how good those nodes are. Right. So so maybe um From here, like the human could play here, the human could play here, or human could play here. And we're gonna um store essentially the V theta for each of these things. So V theta of you know node one. Or like node one um prime b theta node one prime.

🔇 Silence

B

Um and and so we're we're basically using our neural network to make an intuitive guess of how good is this um board from the perspective of this player. And uh fortunately, because the uh it's a zero-sum game, it's easy to deduce that you know the value for this player at this this step is just one minus the value for you know from this perspective. So it's easy to flip the search process uh depending on which player you're at.

A

Um

B

And so this is the expansion step. You've taken a non-leaf node and expanded it and evaluated the value. And this is essentially a quick guess as to like, If I were to play to the end, am I going to win or not? So you can almost think about the V theta as a shortcut for searching to the end of the tree for any given simulation.

A

Um

B

And then we're go and this is this is essentially the evaluation step. We're evaluating the quality of each of these ports. In original Alpha Go Lee, they actually did something uh kind of interesting, which is that they took this value and they averaged it with the value of a real Go playout. So they actually played a real game from here all the way to the end.

So like I'm just gonna draw this squiggly line to indicate some path And uh they kind of like play this all the way to res Trump Taylor resolution. of a full board. And so this is like a zero or one, right?

🔇 Silence

B

And so they took this value and they just averaged it with this one here. So that the the the the formula they did was like uh you know alpha times v theta of of like you know some some node um plus uh sort of like one minus alpha of a pl uh of a true randomly sampled plant.

🔇 Silence

B

And you might be wondering like, okay, well, how do they play this out, right? Like it would be very, very costly to do another search on on this play out, like almost like a tree within a tree. So they don't do this. Instead they just uh take the policy network.

and play it against itself. So they just take this as both players and they just play it all the way to the end. And and um this is something that helps ground the um the estimates here in in reality because you can get a single sample estimate of like whether you win or not.

You can think about in the end game where the board is almost resolved that this one actually becomes quite useful because the random the the the play according to the policy will most likely decide a pretty reasonable guess of the game. And so you're not facing a problem where this one kind of becomes untethered from reality.

It turns out this is totally unnecessary. So in all subsequent papers after AlphaGo Lee, they just got rid of this. In my implementation, I also did the same and it speeds things up a lot because you don't have to roll these games out on every single single.

C

Okay, so uh again just to uh m uh uh reinforce my own understanding and just to re-explain it. For the items, by the way, uh in case it's not obvious, the P there in the select that is the probability coming from the network. In this case. Correct.

B

The policy network here.

C

Okay, so fundamentally the i a simulation, just think of it as like rolling out one more node in the search process.

B

Almost. A simulation is easy to think about when the whole tree already exists, right? Walk down the tree um using the pucked selection criteria, and you you uh and then and then you keep going. Now uh In AlphaGo, the d the data structure is such that we begin with a tree that has no like basically only depth one, which is its only children.

And you want to iteratively build out the tree as you're also selecting actions down the tree. So that's the kind of core thing here is that because Go is such a combinatorily complex game, you cannot afford to build the tree in advance and then search it. Search while building the tree.

C

Okay.

A

So um

B

Let me just finish up with actually the last step, which is the backup. So once you've scored these things, you basically take the mean, the the value, the the Q value assigned to the node here for taking this action, is now just the average across the. your evaluated values here. It's uh you take a running mean over over uh all of the um the the simulations that you've taken and they average the values of the children's notes.

So that's what is known as the backup step. And once you evaluate this, you can actually kind of recursively go back. So if you know the you know the action value of this node, you can then take the average on its parent and so on and so forth. So so you have this kind of four-step process where you are

Choosing the best action that you know of so far, then you may run into a node where you uh you you haven't been to before, so you need to grow the tree a bit, and then you run it through the network to guess whether you're gonna win or not. And then you walk all the way back up to the to the root node to update your values on what the best moves are.

So as you do this iteratively, this selection criteria will cause you to visit the but because you're always selecting it according to this criteria, you're always going to be selecting the best action you think at any given branch. Right. So so the final um visit counts of like how often you chose these things will reflect your correct policy distribution as induced through this search process.

Um and so so the visit count that we store in the node earlier actually becomes the sort of vote for like which way we should finally select an action here. So um, you know, as a sort of test of understanding it's worth thinking a little bit about Whether we could make this even simpler, right? Like could we actually maybe even get rid of this one and still make the thing work?

Recall that, you know, when you do an expansion and then an evaluation at let's say this node, you are you are checking the sort of win probability of each of the child nodes, right? And so if w this one is you know like one and these are zero. um you do kind of know something about which action might be better to take. And so why would you need still need this, right? Like why not just uh normalize this one into some distribution and call that your policy distribution?

Um this is fine, you can do this, and um this probably does work, but in practice having a single forward pass that gives you a pretty good guess is um is how the uh the breadth is is uh is pruned out. Um The there is a sort of duality here. Like it would be weird if, let's say, the policy

recommended an action that disagreed with the value. If if let's say a policy said this was very high probability, but this one said it was a you know low value, then there's actually something kind of fundamentally wrong between between your policy head and your value head. So they are linked, and uh you probably could get rid of this if you came up with a different way to recover this from just the value evaluations. Right.

C

Just to make sure I understand. The reason you don't do that is so that you don't have to do three hundred and sixty independent Uh forward passes to like, okay, here's the value of everything, let's talk max over it, right? Instead you can just do one forward pass and get like the probabilities of all of them.

B

Um you can usually batch these somewhat efficiently. Uh so it probably is not a huge computational burden in practice. But yes, you would have to pass 361 board like up to 361 boards into a single mini-batch update to evaluate all the values here, then normalize them. Um now there's actually a a more uh important reason why we still do this, which is

how Monte Carlo tree search is used to feed back on itself and and sort of recursively improve its own predictions and search capabilities. And that's where this this one, having this as an explicit entity you're modeling rather than an implicit normaliz normalization over your value is is a is a good idea. Okay. Okay. So um so we we talked about the simulations, and basically, you know, what you end up with as you roll out the number of simulations is a tree that kind of looks like

I'm I'm I'm drawing a very low-dimensional version of this. Of course, it's in in in the real game, it's much more high-dimensional. But like you you'll end up with basically a tree structure that like has um A lot of leaves that kind of terminate and are not visited again because their value is deemed to be too low. But then you know, along one path there will be a set of actions with very, very high visit counts that kind of gravitate towards that one set of decisions as you increase uh n.

So this is kind of like the mental picture of what the tree in Monte Carlo TreeSearch looks like. And you should contrast this with like an exhaustive tree like in Tic-Tac-Toe, where you could say like, you know, there's there's nine actions and then eight. And then seven and six. And so it's a sort of like nine factorial sized um tree. Um the Monte Carlo tree search in Go is very, very sparse, right? It only considers the paths that you've expanded child children's nodes.

Okay, so um now that we have the search algorithm that applies the value function as well as the policy function. Um We can now talk about how the Monte Carlo tree search algorithm can actually act as a improvement operator on top of these guys here.

C

Twenty years ago, Jane Street's data center fit in the corner of an office. Ron Minsky, who co-leads the tech group there, told me about how it all got started.

D

One of our compute clusters we called the Hive, and I remember the first version of the Hive was literally like six Dell boxes stacked on top of each other at the end of the row. And the trading systems themselves we also had there because we actually wanted the ability to make sure we could turn the damn thing off. I mean there were ups and downs, like literally at some point

You know, one of the people who was cleaning the office unplugged one of the trading systems in the middle of the day as they were vacuuming. So, you know, i in the end it is in fact better to have it all in a data set.

C

Gene Street Stata Centers have come a long way since those six DELS. And I got to tour one of them in Texas with Ron and Dan Fonticle. who leads James Street's physical engineering team.

F

You know these cabinets, these GP300 cabinets, consume at peak about 140 KWE. Compare that to tr traditional air cooled, you're talking about 10 to 40 kW? It's a lot more.

C

deep into the details of running one of these data centers, things that I had never considered before.

F

It's filled with a liquid, uh a mix of distilled or deionized water and uh propylene glycol, twenty five percent of propylene glycol. Um that's to inhibit any bacteria or algae growth.

D

I don't love the world where we have to worry about bacteria growing in our service.

C

I got to see way more of what actually happens in the data center than I've ever seen before. Jane Street was willing to literally pull up the floorboards and take out the racks and take me to the vacuum. where all the chillers are. You can check all of this out at janestreet.com slash tourcash where we posted the full tour.

Self-play

B

Okay, so um we now talk about the RL part of like how this thing gets stronger by playing itself, right? Um Let's say we play a game where you have the AI, so you make a move. AI AI um will will kind of Compute the search and then this is this sort of visit count distribution. Um let's say this is your policy, your policy, initial policy recommendation at the at the at this node. And then after MCTS.

It uh gets more confident about one of these actions, right? And and so maybe the the distribution looks a bit more peaky like this. Based on the search. Now, of course, you can tune the search process so that it ends up more diffuse, but that's probably not a good idea. MCTS should get more confident about specific actions.

Than others, but it of course might place a lot of weight on other actions initially, and then as you increase this number of SIMs, it should converge to a very peaky distribution. Um so this is your new uh let's call this like Let's let's wrap this in like a MCTS operator. of you know a given s right. After applying MCTS process, your your policy recommended distribution looks like this. It's a bit more peaky than than the previous one.

Um and so then you take the uh arcmax, or maybe you just sample from this. Uh it doesn't have to be the arcmax, and then you you make your move. And then um and then you throw away the tree, and then you you begin anew on the next move, right? So Again, like you um you know you compute a new distribution.

🔇 Silence

B

So initially maybe your guess looks like this, and then you refine it through MCTS.

C

There should be one more X on the board, right?

B

I'm sorry. That's correct. Um to something that looks like Right. So on every move, you have your initial guests from your policy network. Um and then the search process that combines your policy network and your value network arrives at a more confident action that you take. And and then so on so forth. And then the game ends and one person wins and one person loses. So a um The way that uh the the beauty of of how AlphaGo trains itself is that it actually can take this.

final search process, the outcome of the search process, and tell the policy network, hey, like, you know, instead of Having MCTS do all this legwork to arrive here, why don't you just predict that from the get-go? Why don't you like you know not use this guess and just predict this to begin with? And if you have this guess to begin with in your policy network, then MCTS has to do a lot less work to get things to work.

And so if we draw like a sort of test time scaling plot, um, so so let's say like this is like number of simulations. Um let's say, you know, at at zero simulations your your sort of implicit win rate is like um is like, I don't know, here, and then and then um without any simulation, if you just take this raw action, that this is your what your win winning rate is. And let's say as we increase the number of SIMs, um maybe, maybe you kind of have a a win rate that looks like this, right?

So when you search for let's say a thousand simulation steps. That gets you to a um a policy here that gets you to here, which is great. But if you were to distill this MCTS policy network back into your sort of uh shoot from the hip policy network, then you could actually um uh you know start here. Uh if if let's say this was, you know, zero um for uh by by distillation, then if you spend another one thousand

then you actually kind of get to here. It's almost like if you could just, you know, am um amortize the the the first 1,000 steps actually into the policy network instead of the search process, then you could begin at a much better starting point and then get a much better result. For for your uh for the number of sims that you put.

C

Sigmoid type nature of test time scaling as the number of simulations increases, the the increase in win rate is uh smaller. Is that true even for the distilled network? That is to say, is there some gain of like, okay, we start from the distilled, we get these early gains again, or is that just inherent to like the nature of

A

Yeah, MCTS.

B

To be honest, I actually don't know the test time scaling behavior of MCTS simulations. And I I believe it might actually be quite sensitive to how strong this one is in practice. I'm just drawing a monotonically increasing function that gets to one. Okay. Yeah. So don't pay too much attention to the shape of the curve. Just know that it's monotonic with respect to.

Okay, so um so so the idea of MCTS is very brilliant, which is like we're gonna we got something better by applying search. Yeah. And um we're going to now on our next iteration of updating this network. Just train this to approximate the outcome of a thousand steps of search. And so instead of starting here, we get to now have our neural network start here and then and then you know the the play gets stronger once we then apply another thousand steps on top of it.

And you can keep going, right? So the training algorithm for AlphaGo is to basically take the games where you've applied the search on every move that the policy encountered. Uh whether you w won or lost, and that that's quite important, um, and you're just gonna train uh the the model to imitate the search process. So um there's a analogy to robotics actually, which is the dagger algorithm. Um uh first I'm gonna draw like a uh schematic of like let's say you know the states, right? So s0.

S one S. S3. So let's say you know we t we took a series of actions in an MDP to uh And these actions may be suboptimal, right? Maybe we lost at the end of this game. There is a family of algorithms that basically take trajectories and relabel the actions to better trajectories. So maybe a better action here would have been to take, you know, A0 prime, a better action here would have been to take a one prime, and yet another one like A2 prime. A three point.

What MCTS is doing is basically saying like you play this game where you eventually lost. But on every single action, I'm gonna give you a strictly better action that you should take instead. It does not guarantee that you are going to win, but it does guarantee that, you know, if you take these So that you retrain your uh your your policy network to predict these ones instead of these ones, you're gonna do better.

And uh this is very related to dagger in uh robotics and and uh imitation learning, where you want to collect a intervention here. And even if you're in a in a not great state, for example, like a self-driving car that you know veers off the side of the road, there is still a valid action that kind of corrects you and brings you back.

C

Okay, so um Pedantic question. But is there a guarantee that MCTS must be better than the policy? For example, you could imagine early on in training, because MCTS is informed by the value network, early on in training uh when the value network hasn't been well trained on finished uh games. Um that like MCTS is worse than sort of randomly initialized policy. So is it just like a heuristic that MCTS is better than the policy, or is that like is there some guarantee? Right.

B

In practice, it is a heuristic. Um, and it does work in also in practice. But let me illustrate a example where MCTS can give you a worse distribution than your policy network. So um and this can often happen if your self-play algorithm um has trained to a good point, but then somehow it's uh it it collapses because th it's um it's not trained on diverse data or something, right? So um let's say we have a board state where

The policy recommendations here are very good. So like, you know, pi of as is like great.

E

All right.

B

But um somehow because maybe we're playing on a lot of games where the the bots just resign instead of playing all the way to the Trump Taylor resolution, they kind of forget um how to evaluate those kind of late stage plants. Like in the in the case that we showed with the corner play, maybe like a hundred percent of our training data in our replay buffer has lost examples of how to evaluate the value function at those stages.

So you might end up in a scenario where your terminal value is like very bad. if the terminal values of the leaves are not good, then this will actually propagate all the way up and cause your um your your pucked selection criteria and your backups to be off. And then you'll end up visiting a very, very different distribution than what your policy initially recommended.

Um also if your number of SIMs is low, then you might also have a variance issue where you just don't explore enough. Like like uh it's only guaranteed to converge when you kind of take n to infinity. Um

A

Ciao, ciao.

B

variance in, you know, your search process as well as inaccuracies in your evaluation can definitely screw with the quality of your policy network recommendation. And so that's why it's not a guarantee to improve. But and and that is why I think I I suspect why AlphaGo Lee had the uh playouts to the end in their training algorithm so they could ground this thing in real playouts.

A

Ja.

B

In practice, what you could also do is just like for 10% of the games, you you prevent the bots from resigning and you just say like resolve it to the end. So you get some training data in your replay buffer to really resolve those kind of like late stage playouts that normal human players would would kind of uh not play too.

A

Yeah.

B

So um so this is why MCTS kind of, if you assume that the value functions are correct, uh why it gives you a better policy is because. Yeah, assu uh and it's a very critical, you know, chain of assumptions. Assuming that this is accurate, then uh your search process should give you a better recommendation than your initial guess.

C

Okay, so if you have a cold started policy. uh if you have an office zero type thing. Yeah. Really what's happening for the first few epochs is the policy is kind of useless and what you're really just doing is, hey, but let's play full games and uh once we have a played full games for the preceding moves We'll have labeled Who won, who didn't win. And the loss for Alpha Zero has two components, which is like how good is the policy relative to MCTS?

and how good is the value prediction relative to who actually won the game from this move. And this is this sort of like you can think of this a being applied to every single action or every single move. Correct. And really what's happening in the beginning of Alpha Zero training.

is just like we're trying to get the value function to actually predict who will win the game if you're if you find yourself in this state and you're this player, uh and functionally that's all that's happening. And later on, once that's well trained, now the policy is also improving. Correct.

B

One trick I did find to be pretty useful, and this isn't not a peer-reviewed claim, so just like take this with a grain of salt is like I I found it useful in my own implement implementation to do the do the following. You want to first make sure that this is good before you invest a lot of cycles doing MCTs. Like like it doesn't really make a lot of sense to do search on garbage value predictions.

Um so you want to kind of start at a good place where this works. The AlphaGo Lead does a very good thing where it just takes human games and then you you like uh train on it and it just works, right? Totally works. Um you can also take an open source Go bot, play it against itself, um generate data. Also works. Uh so so if you have some like offline data set that um that uh has realistic

good play, you can easily learn the late stage value functions pretty well. And that's the that's what you kind of need to start the search process.

C

Can you just read this and then?

B

Sure. So so um it it's quite easy to evaluate a late stage go go game. Like when almost all the pieces are on the board. Like it's almost like a decidable problem, right? Because there's the lower and lower uncertainty as to like the depth of the tree. So um most games played to the end by reasonable people will be good training data to train a good value function at terminal parts of the tree.

Then as you play more games, the um the the search will back up good values into the the sort of intermediate nodes of the tree. And then like as you increase the amount of data, your your value head gets a good intuition of like what is a healthy board state versus a not healthy board state. That that those are much more subtle to judge in the mid game than the beginning or the end. So the most difficult part to score is like

Not the beginning or the end. Because the beginning is just like obviously 0.5, and then at the end it's like pretty obvious who's winning. So the hard part that you want to learn in the value function is like who is winning in the middle.

C

And so you're d this is this is actually very analogous to T D D learning.

B

Yes. And there's a beautiful connection to TD learning that we can talk about in a bit, as opposed to, you know, contrasting with Monte Carlo Trice. So you first want to get good value functions, and expert data can kind of give you a quick shortcut. I recommend for you know practitioners just do that first just to you know initialize to a good starting point. And then if you want to do the alpha zero thing or uh or codigo kind of s uh table rasa learning.

Um then what you can try to do is on a small board play random games, just take a random agent. And if you play like

A

Uh you know.

B

fifty thousand games, you'll actually learn a pretty good value function as well. Because on a nine by nine board, there's actually um You can see enough of the common patterns with random play. And then if you train a model that kind of can train on both 9x9 and 19 by 9 data, um, and Katogo was a was a uh proposed one of these architectures. then um there's some pretty good transfer learning from the value uh value head evaluated at nine by nine to the nineteen by nine.

C

Right, because this uh this unlike other games has much like a very much a sense of like there's not like a new kind of piece that is introduced when you increase the size or something.

B

If we take it to its limit and consider like a very tiny, like four by four go board, like if you play fifty thousand games, you're gonna have a lot of end states that look like human play, right? Like it's just like tic-tac-toe at that point. Um so so if you like broaden this a little bit to like

It's not unrealistic to imagine that like purely random play will actually generate pretty reasonable looking boards. Um and then so you can score those pretty easily. And so that is what gives you the bootstrapping to be able to then improve your policy with security. But it's very, very critical that MCTS has accurate value uh estimates. And you need to ground the value. Ultimately, MCTS will fall apart if you don't have a grounding uh function for um for the value.

C

I I'd be curious w how how much compute you save by training the value and policy on the same network. That because they share the same representations, how much more efficient learning is?'Cause that would be interesting if um they're basically kind of

Uh we've just talked about how they're kind of making similar predictions or they shouldn't be in line with each other. Yeah. And so I'd be curious if like actually yeah, you just you're you're like having them on a compute you had to do by keeping them the same network.

B

Right. AlphaGo Lee, the original AlphaGo paper had two separate networks. And then um in all subsequent papers um they they merge them into two heads. And presumably this saves compute. But answering that question in a very rigorous scientific way is actually uh it's it's a simple question, but but in practice actually takes

Like if you really want to chase that question down to its its limit, it's uh it it takes quite a bit of work to you know really resolve that. Um but intuitively yes, they share a lot of representations. So so and you know, as we mentioned, there is there is a sort of like Your policy network and your value network when doing evaluation should kind of agree, right? So that there really should be this sort of consistency between them.

C

Yeah. I feel like when I learn how an LLM works and how simple RL VR is, at least the as an algorithm, how simple it is. I'm sort of stunned by the kinds of things it can do, uh, that it can learn how to build very complicated code repositories and whatever, simply from getting like a yes-no. And here, I feel like the m-if you understand it more deeply of like just predicting MCTS and

It actually seems AwfulGo seems less impressive in retrospect the more you understand it because you're like, oh, you're putting in a lot of bias by just saying how much you'd ex you're like telling it how it should titrate exploration as things go on. You're building this very explicit uh tree search for it. And so I don't know if you share that intuition or It actually the more you understand the less impressive the accomplishment in twenty seventeen seems.

B

I uh I personally disagree. I I think they're profound for different reasons. And I don't understand the LMRL like enough to like kind of comment on your podcast about it. Um but um I think AlphaGo so so yeah, why is it a profound accomplishment? I think maybe it's worth stepping back a little bit and just like uh it is different than modern RL and we can talk a little bit about like s some of the algorithmic choices there. But um I think the most profound thing here is that

A 10-layer neural network pass. So basically 10 steps of Ten steps of reasoning. Yeah. And of course the reasoning is not just one trail of thought. It could be like the distributed representations and a lot of thoughts going on at the same time. But uh by construction, let's say a 10-layer neural network can only do 10 sequential steps of thinking, right?

um ten steps of neural network paralyzed distributed representation thinking is able to amortize and approximate to a very, very high fidelity a nearly intractable search problem. So so this was a breakthrough that

I think most people don't even understand today, like fully comprehend like how profound that accomplishment is. And and this is what also GERDs like um Alpha Fold, for example, right? Like um uh where you have a very, very difficult physical simulation process that you would need to roll out so many micro scale simulations, and yet like ten steps of a somewhat small neural network can

somehow capture what feels like a you know MP class uh problem into a single um problem. And and so I I it it it actually makes me wonder if you know, our understanding of problems like P equals NP or, you know, these very fundamental like computational hardness problems are

Incomplete, right? Like like it's it's not like, you know obviously this is not uh proof of like P equals MP or anything, but but there's something to it that like kind of is very disturbing where like what felt like a very hard problem can fall to a very, very simple macroscopic simulation.

C

Now there's a very interesting insight that A lot of problems which are proven to be NP hard, like I don't know if Go is proven to be NP hard, but uh okay, protein folding, et cetera. have been like neural networks can solve them because they're NP hard in the worst case but we're not dealing with the worst c we're usually not concerned about the worst case. We're you know like the these problems have a lot of structure to them. Usually

B

The the kind of question we should be asking ourselves is like, we've been formulating you know, solutions to MP hard problems as in the kind of worst case complexity. And I wouldn't say, you know, this solves Go, right? It doesn't give us a exact solution of the optimum. Yeah. But in practice, like it is extremely useful. And the same thing has been shown in like Alpha Tensor, Alpha Fold, where like

Yes, there is a very hard problem that in the worst case seems intractable, and yet we're able to make like almost arbitrary amounts of progress. So so here's a sort of like uh you know, in the limit what is what what what might this look like, right? Well um If you want to simulate something very complex like weather or predict the future, like do we live in a simulation or not? Um, the computing resources you need to build a very complex simulation might be much.

Smaller than you think, based on our ability to amortize a lot of that computation into the forward pass of a single network.

C

Interesting.

B

So to me, yeah, Alpha Gold was the first paper that kind of like really showed this like profound level of, you know, simulation being compressed into a small amount of

C

Um I I feel totally not at all qualified on the uh computational complexity of the math uh to comment on this. But I wonder if um there's an important role of chaos here where If under what is the problem with weather and why does it take ten x the amount of resources to predict weather a day out? Uh and continually so for every m more day out. Is because it's a chaotic system and so small perturbations can totally uh change

the final estimate as as time goes on. And um I guess it's interesting well, I guess you would expect that for Go and protein folding as well.

B

So here's an analogy to weather that that might be relevant in Go. So um the problem of like, you know, here's our current board state.

C

Yeah.

🔇 Silence

B

Given what we know about both players, What is the board state in the future? Yeah. What is the exact board state in the future? This is uh this is extremely sensitive to initial conditions. Like a single stone place here can kind of disrupt the entire prediction. Yeah. So this is hard. This is kind of intuitively the chaotic problem. Um and yet, somehow, So this is this is hard. Somehow we can predict who's gonna win.

C

Yeah.

B

Yeah.

A

And this is the

B

captures a lot of possibilities here. And so that there's this more macroscopic quantity that we really care about, which is the average or expectation or some sort of global macrostructure over a lot of like you know possible future.

C

I don't understand anything about it

B

And so so in in weather it could be the same thing, right? Like we don't exactly care like what the you know velocity of wind six thousand uh feet above a specific latitude-longitude is. We kind of care like

Where's the hurricane? Or, you know, you know, things like that. And I would say like in Chaos, you know, there's a classic like Lorenza tractor, which kind of looks like this, right? Um, yes, you you don't if you start anywhere on the Lorenza tractor, you don't know where you're gonna end up.

But you you do know that the thing looks like this. Yeah, yeah. Right. And and so there's this kind of beauty of like sometimes we don't necessarily care about the micro-scale things. We actually care about the macroscopics.

C

Yeah.

B

And these things can be predictable.

C

Aaron Powell And contrast that, say, to something like a hash function, which is also incredibly inde uh dependent on initial conditions, but doesn't have a macro structure, or at least hopefully if like the on the work.

B

Yes, one of them.

C

And so there's like no equivalent of a value function or like broadly how's the weather gonna be that is i interesting there. It's really just about what is the move, what is a board gonna look like a hundred moves from now exactly.

B

Uh yes, intuitively that seems correct. I I uh and then again this is uh also out of my my area of expertise, but I um I find it interesting that like cryptography has not been able to like um the tools of cryptography and uh you know um hashing have also not been able to prove that like uh you cannot come up with fast approximation. Like you cannot come up with fast approximations. If they were able to do that, then you could prove P is not equal to MP.

C

In fact, we know that there's structure in many cryptographic protocols, obviously like uh uh RSA cryptography. There is structure and that structure is what quantum computers exploit to uh to break them, right? Reiner has a very interesting blog post, which we talked about in the episode, where he uh talks about how if you look at

at a high level what cryptographic protocols look like and what neural networks look like. It's extremely similar where you have sequential layers of jumbling information together. And it's because there's been this conversion devolution in the algorithms where in cryptography you want the final state to be incredibly sensitive to initial conditions so that it can come out sort of looking jumbled based on if you change anything. And then neural networks.

You similarly want everything to be dependent on all the information because you want to process all the information and consider how it relates to itself.

B

Yeah, you have the maximum power of a neural network at the edge of chaos. Um I think there's some like research papers from uh Joshua Scholdik's team on on on this. Yeah. Yeah, like like it's there's something kind of quite fundamental about like Chaos that is it's not just like hopeless noise. It's like there's something kind of

useful, right, in in um in chaotic systems. Yeah. At least a at that boundary. Um but yeah, this is just my like think about this as a philosophy. I I don't I don't actually know the math uh well enough to comment on it. Anyway, if we um go back to um We we'll talk about LMRL in a little bit because there's some connections there. But let's just go back to like the MCTS. Like what is it doing? It is not um crucially it is not saying we're going to uh

Increase the probability of winning directly. It's not gonna say like we're gonna um upweight all actions that won and downweight all actions that didn't win. Um importantly, what it is doing is saying for every action we took. We did a pretty exhaustive search on MCTS to see if we could do better. And we're just gonna make every action that we took better by predicting like having the policy network predict that outcome instead.

And and so this is um a very, very nice idea because you have one supervision target for every single action. So the variance of your your learning signal is very low compared to the alternative naive RL. So let's actually consider what let's consider a very naive algorithm uh uh uh that looks a lot more like you know modern LLMRL today, where where we do something like um let's take the winner of a self-play game and encourage it to do more of that.

Alternative RL approaches

Okay. So uh it's worth kind of thinking a little a little bit about like okay, what are some alternatives that we could do to train self-play agents instead of MCTS, right? Like, you know, we we use a lot of uh LM style RL these days, like is that relevant? Could we do that instead? Um so let's think through this a little bit. Let's suppose we have a very naive algorithm where we take a league of agents of different checkpoints and we play them against each other.

And um for the for the games where a single player wins, uh, we're gonna reinforce those actions up and then and then uh and and train retrain the policy network to imitate those those guys instead of uh instead of the MCTS objective. Um So what ends up happening is let's say you have a chain of actions that led to a win. And you're you're you have a matchup between two agents that are basically the same. So in fact, let's just assume that like uh you know policy A.

And policy B are like evenly matched, right? So their true, their true win rate is is like 50%. So let's say you play a hundred games. And then uh each game, let's say, lasts you know 300 um moves. And um you're doing some sort of like

evolution strategy or some way to perturb these things to get them get them to do different things, or maybe you don't, and then you just play them against each other and you see like occasionally this one might actually have a better strategy than this one. And so so let's say um you know 51 games, um the policy A wins. And then forty nine games, policy B wins.

And this is just due to random luck, or maybe you perturbed policy A in some way that let it do this. And just to have a very, very simple model, let's pretend that for f uh for like uh 49 of the games. They played exactly equally. Um I'm sorry, for for fifty of the games, they they have played exactly equally, right? Um

And on that one game where this one won, it it played slightly differently. It made like one critical move that like, you know, normally it would have done differently, but uh due to some exploration or some random noise, it just happened to make a smarter move than it did previously. So you have one supervision signal, like one true supervision signal for your policy network.

And then you have uh 99 games times 300 moves, for which uh imitating those actions gives you exactly the same policy you had before. And so the um the scale of your variance is actually very bad because it's like you only have one label out of this enormous data set of of actions, of supervision actions, where you want. Actually, sorry, let me let me let me clarify a little bit.

C

Okay, so we're just talking about how the good move The outer distribution move is a small fraction of all the moves that are played across all the games on which you'd want to train. And um this of course reminds me of how LLMs are trained with policy gradient methods. Uh Karpathi was when he was on the podcast called it like sucking supervision through a straw.

A

Um

C

And uh and so yeah, it's interesting that this like this thing you're saying which would be intractable and prevents you from actually getting beyond a certain level in Go is just by default how LLMs are trained, question mark?

B

Right. So um in this case, uh th this is not to say it doesn't work, right? Like if you imagine increasing the number of games to like you know millions of samples, you actually can get some meaningful supervision like s samples so long as you find a way to sort of Mask out the supervision from these guys. And then this is where things start to get pretty related to RL in terms of advantage and baselines and so forth. So so let's um

Let's look at the equ you know the gradient variance of a very naive approach like this, where um I'm just gonna call it like gradient RL. And it's basically the sum of rewards.

🔇 Silence

A

Okay, I see what you're saying.

🔇 Silence

B

So the sum of rewards is the return, right? So so like uh in in our naive setup here, we only have uh indicator variable for the return where either you won or lost. So So in the case where you lost, well you just don't train on your gradient is zero, you don't train on those examples. And when you won, you try to predict those those things. So the you can think about this setup as a as a special case of this general formula here.

The um the trouble here is that this is very high variance because when you multiply these terms out, when you take when you try to compute the variance of this, and so so variance of the gradient. is equal to expectation of Squared minus

And just for simplicity, we can pretend this is like you know on average zero or something. We're c if if we're centering it at at you know no signal. Um And uh the variance here basically means that you're you know taking the square of this product term, and so you end up with uh A term that kind of grows quadratically with the with T. So so uh variance um when you have a setup like this, th this thing acts as a coupling effect on top of of these terms.

Um let's actually map this to an LLM case and we can answer like why do LLMs only do one-step RL instead of a multi-step RL scenario. Um in LLMs you have a decoder that might predict some words like hello world. And so in current LMRL, they treat this entire sequence. action, just A T, and big T is just one, right? And so yes, it is true that you know the uh because of how you know transformers are formulated uh thro through the sort of product of conditional probabilities, um, we do have

Uh you know, probability of this sequence is equal to the sort of sum of log log probability of the whole sequence is equal to the sum of the probabilities of like, you know, uh individual tokens, right? So so in this case I would um I would say something like, you know, log hell plus log low plus log. So um this is true. And if this term were one, then they would be the same thing.

However, um in sampling things, if you have a reward term assigned to every specific token, now you have these interaction effects between the cross-multiplication of these terms and these terms. Right. And so the problem becomes how do you ascribe the credit associated with every episode to all these different terms?

C

I guess the thing I'm confused on is how what would that even look like to do it that way in um

B

And LM.

C

In LLMs because you do you only do get a reward at the end of the episode. So

B

You could imagine a reward that says like, I'm gonna give you some process supervision where you get a reward for each of these actions on every step.

C

Okay, so you're saying it instead of doing it that way where you um But I guess the way you've written it, it would be a sum at the end anyway. So it would they they wouldn't have to be multiplied. But you're saying instead of doing it that way, you would just add up this process rewards at the end and then treat that as one single reward signal?

B

Correct. For one single log product action.

C

But um isn't that how it's written to begin with anyway? Like the sum of the r uh rewards.

B

So the the thing that's a little bit hidden here in the math is that we're assuming that when you decompose the problem to a multi-step problem, that you're now introducing kind of correlations between your actions through the computation of the And uh so you if you separate these things out, then there will be um this this will magnify the variance of of this.

So in in the case where you don't separate it out, if you just have t equals 1, you just have a single estimate of log prop and a single estimate of reward. Now uh there are this this term still shows up in uh LLMs. So in LLMs, it looks a little bit more like the the naive reinforced estimator looks a bit like return of the single action plus uh times, you know.

🔇 Silence

B

It looks kind of like this. This is sort of the very uh basic form here, but this is still a contributor to variance. Um so you want to make sure that like you don't uh similar to how in this case we were training on a lot of neutral labels. You want to make sure that you're subtract you're you're sort of penalizing the labels that don't help and only rewarding the ones that actually make you better. Right. Right. So intuitively, the analogy here is like

Can we find a a term in our training objective such that it's actually kind of discouraged from doing this or, you know, these don't have any effect on the gradient, and this has an effect on the gradient. Right.

C

I I guess if you applied that there, the only thing you could do is eliminate Forty nine of the games. So At least if you the way you have it written there, it would be um fifty one times

B

Actually the optimal case is to pull out discard all of these moves. get a gradient on that single move that you got better. Right. So this is a pretty tricky problem in practice. And so this is where advantage estimation happens in reinforcement learning. So you want to subtract um you know a term from

from your your your multiplier instead of an indicator function of like one and zero, you want something that kind of behaves like a zero for all of these guys. Yep. And then a one for all these ones.

C

Yeah. But you so you could do that if they're um If you can say, Hey, I won this game, so th this is slightly above baseline performance.

B

Well you won on a lot of games.

C

Exactly.

B

But you don't know which ones let you win because they were truly better versus winning on accident.

C

How would you design a baseline where it's truly better?

B

Yeah, so this is where in RL people use things like TD learning to better approximate the quality function, the queue that we mentioned earlier. So you can try to subtract that from your um your your your return. So ideally what you really want to do is in RL, you want to

A

Um

B

push up the actions that make you better than the average and um push down the actions that make you worse than the average. And they call this advantage. There are multiple ways to compute it. I highly recommend John Schulman's uh general advantage estimation paper as like a

good you know treatment on how to um how to like think about various ways to compute it. But the at the end of the day, you know, um you you want to reduce variance by p trying to make this smaller and so that it doesn't magnify the variance of the

C

Make sense. So but um this requires you to have a very good estimate of what average performance from a statement would look like. And this is this gets back uh gets us back to the value function thing we're talking about earlier.

B

Right. And so this uh keep in mind that in this case, this model free RL setting uh is trying to solve a credit assignment problem where you don't know which actions were actually good and which ones were bad. Monte Carlo TreeSearch is doing something very fundamentally different, which is it's not trying to do credit assignment on wins. It's trying to improve the uh the label for any given action you take.

And so we can actually think about a completely different algorithm called neural fictitious self-play, which was used uh to great effect in uh systems like AlphaStar and um and the OpenAI's Dota button. Um so let me talk a little bit about how Um how you can kind of unify some of these RL ideas in the model-free setting as well as the self-play setting. Okay.

What happens if you don't have the ability to easily search a tree? Like in Go, it's a perfectly observable game. You can easily construct a pretty deep tree that completely captures the game state. In a game like StarCraft where you don't have really complete control over the binary, it's it's a little bit hard to do this. And I'm not even sure if it's a it's a deterministic game, right? So so that makes this uh kind of difficult from a data structures perspective.

So um what is done instead is that the the basic idea of Supervising your actions with a better teacher is still there. So if if in a given neurofictitious, so we're gonna talk a little bit about how neurofictitious self-play works.

🔇 Silence

B

Same idea, we're gonna like come up with better labels for each of the actions we took, just like in MCTS. But how do we derive the better labels?

🔇 Silence

B

In MCTS, we uh perform search to and assuming we have a good value function, the search will kind of give us a better result than our initial. In a game where you can't easily simulate a search process, what they do instead is train what is known as a best response policy.

🔇 Silence

B

So you fix your opponent. So let's say you you're trying currently training pi A against. Um a strong opponent, Pi B. In StarCraft, maybe like you know, those are the Zergs and you're playing Protost. So you fix your opponent and you treat this as a classic model-free RL algorithm where your goal is just to beat this guy.

And so here you use your standard TD learning style tricks, or use PPO, or any actually like uh you know model-free RL algorithm to try to hill climb against winning this player. And so um you train you train uh basically you you you have a reward function that's like, you know, return is like, you know. One if wins. Again.

I B. So this is no longer a self-play kind of problem, right? This is just like a fixed opponent and you're just um solving it trying to maximize um a score against against that. And then you know, zero otherwise. And so you're you you have a sort of fixed environment where all you care about is just beating the And um

Once you have a good policy that you trained with, uh, you know, pick your favorite model free R algorithm, PPO or SAC or you know, any kind of mixture of the or you know uh VMPO or whatever. Um You now have a good policy that gives you a good label for what this one should do when playing against that player.

And when you train multiple best response policies, you can basically then distill the RL algorithms into the labels for a given opponent. So you might have, let's say, a best response policy against PyB. And then maybe you have a collect a league of you know um of opponents like pi b, pi c, pi d. And you're gonna take the best response policy that you train against each of these fixed opponents, and for this one, you're going to uh supervise them with the label that this one would provide.

So it is kind of like this is almost like a proxy for your m MCTS teacher, right? Instead of MCS teacher, you use a model-free RL algorithm to find the best search action that you could do to uh to to kind of beat your opponent. And then you're finally you're distilling the um the policy here into s uh what is known as like a uh a mixed strategy, where it's trying to basically average across all possible opponents it could play against.

And this is what gives you something that can do no bet no worse than like you know an average league average selected opponent from the league. And and so this gets around the problem of having to derive a teaching signal from MCTS, but it still fundamentally is about relabeling your your your states with better actions so that they improve your

C

And just make sure you understand this is like If you win against against win again against this other policy, you sort of reinforce all the actions uh on that trajectory.

B

Yeah, so so here you can use a number of algorithms like PPO, VMPO, um you know Q learning, even if you want. Uh the specific algorithm here um can be you know it's usually a model-free thing because you don't have search. But it it there's an interesting connection from MCTS and QLearning that I I wanna you know bring up. So in MCTS, you do something where you have a tree.

And through the resolution of your your your value function at the at the leaves of the tree, or you know, your approximate leaves of the tree, you can kind of back up. through through the you know the the sequence of many sequences and then obtain some sort of mean value estimate. Your Q is kinda is kind of derived from the average of a bunch of simulations. In model-free algorithms, There is often a component of estimating a Q value.

And so um and Q Q values are often learned through TD learning, although in PPO the the way that they do advantage estimation is not necessarily through a Bellman backup. But um but in Q learning there's this kind of uh very cool trick where um You do You know, Q S A is backed up as R plus you know some discount factor times the max. So intuitively, how this works is like if you have an MDP.

🔇 Silence

B

And then this is like you know terminal.

🔇 Silence

B

What this is sort of saying is that like the best action you can take at this uh state. Equal to the reward you take for you know taking this action plus the best that you can do at the next state. So there's a sort of recursive and dynamic programming property of MDPs. And you can train neural networks to basically try to enforce this.

This const this uh consistency. So you can say like, well, once I know the Q value of this action, I can then use that to kind of compute something about the Q value.

C

So when earlier I was like, hey, why are we training policy? Why don't we just train the value alone? That is what this is.

B

Um this is a algorithm for recovering value estimates of intermediate steps when you don't have the ability to do forward search. So you must collect a trajectory first of like n steps before you're able to do this trajectory. Um but the intuition is kind of the same, which is that like knowing something about the Q value here.

can tell you something about the Q value here, and indeed you can recover a policy from a Q value. So the um you don't need to explicitly model the policy distribution. You can actually recover the policy distribution by doing argmax over your um your Q values. So QF uh Q-learning or um you know this kind of like uh approximate dynamic programming kind of propagates what you know about the future cues backward like this.

Right. And you can see that there's a sort of similar structure that goes on here where uh in in this case, you you're planning over trajectories your agent hasn't actually been to yet, whereas in this case you're planning over trajectories your your agent has visited. Importantly, why does Q learning, you know, why was Q learning a big deal, right? Like it's because

Historically, we just haven't had the ability to do search on fairly high dimensional problems like robotics or whatever. So for a long time we kind of make the assumption that like, okay, well if we can't model the dynamics with like a world model or something. We're gonna instead just collect trajectories and then plan with respect to the only number that really matters, which is reward.

Why doesn't MCTS work for LLMs

C

Okay, so this is very interesting. And then to unify this with our discussion of LLMs. So with LLMs, you're doing something You don't have Q values, but you're doing this sort of backwards learning where, hey, let's find the trajectories which pass some unit test in some coding environment. And then let's reinforce those trajectories. And then there's a huge difference between that.

And this forward approach with MCTS. And the reason you can do MCTS and i it's much more preferable to do MCTS because you can do it per move uh and make each move better rather than having to learn per trajectory.

Um and hope, you know, as Karpathi said, hope to learn the site. Yes, you get the supervision through a straw. Uh basically just upgrade all the tokens ru in a trajectory that might or might not have been relevant to getting the answer right. The reason you can do this much more sort of sample efficient uh much more favorable thing with Go is that because MCTS works in Go, you basically know that hey, I if I just do search locally here.

And this search is sort of truncated at the end by this value function that uh works even even if I haven't unfolded my whole trajectory. I can just say this is my new policy. Um, and I can improve in a more iterative, like local way rather than having to. Ha having to unfold all these trajectories

B

So there was some research, I think, from Google in twenty thirty twenty twenty three, twenty twenty four, where they did try to apply tree structures to reasoning. And I think it's you know, the jury's still out as to whether this can ever work. So I I would say like It we probably will see like r you know, revisiting of this idea of forward search um in in the future.

But there's two things that make MCTS very simple for Go, which is that value estimation is kind of concrete and you can determine it for real, and then you can kind of uh uh sort of use it to truncate depth, as you said. And then the uh breadth is also uh determined. And what's kind of critical is that the action selection algorithm where you iteratively visit and grow the tree is um

well suited for the size of problem that Go is and the depth of the problem. But for something like LLM reasoning, um, you know, pucked might actually not be a good enough heuristic. It might be too greedy with local tokens and it might do something like, oh, only give you, you know, uh sort of obvious thoughts that are correct, but not really solve your final problem. So I I would say the jury is probably still out on how like

what the final instantiation of reasoning for LLMs would look like. And I wouldn't rule out that like this stuff could, you know, come back. But it's a bit hard.

C

D don't LM sort of night natively learn to do MCTS where they'll try an approach and be like, Oh, that doesn't work. Let's back up. Let's try this other thing and then go in the direction that proves to be more fruitful.

B

Uh yeah, cer certainly I think the LLMs manage to do something that looks like real human reasoning without having to do an explicit tree structure. That being said, I think the idea of doing forward search and simulation to get a better sense of what is valuable might make a comeback, um, even though not exactly in the same instantiation as as uh Alpha.

C

But uh uh just to make sure I understand the crux of it, like the the breadth from the number of legal actions being wider and the depth from being able to not being able to train a value function as easily because

B

So here's an example where LMs break down. The C pucked rule involves, you know, square root of n over one plus n a. In an LLM, like you're most likely never gonna sample the same uh child more than once. So if you have let's say multi-steps of thinking, um because language is so broad and open-ended, it's uh a sort of

discrete set of actions is not really an appropriate choice for an LLM. Even though they're discrete tokens. It's just such a large number that this type of exploration heuristic is probably not the right thing to do to guide how to search down a tree.

C

Right. But I I guess the crux comes down to the fact Yeah. In Go, you know that the MCTS is almost certainly better than your current policy, even though you haven't gotten even though you haven't explored the end of any trajectory. Correct. And then in um in normal reasoning for LLMs or robotics, Th there's no way to just locally evaluate and improve your next move in a way that doesn't result in i in a way that's independent of actually like solving the problem.

B

Uh No way is a strong word. I think lots of people have thought about how to try to apply MCTS or its kind of successors like Mu Zero to continuous control spaces. And I'm sure you know very cool research work is still ongoing to try to crack that problem. Um but yes, the the s seeming challenge right now is that like Most problems in much higher dimensional action spaces or something that's combinatorily much bigger, like language.

They they don't seem as amenable to the kind of discrete action selection heuristics as well as uh kind of game evaluation type stuff that uh Go does. Um but that's not to say the idea of like, you know, thinking into the future along multiple parallel tracks might not give you some information about like which way to search. Right. Like if you think about mathematics.

I think mathematics often occupies a a little bit more of like a logical search kind of procedure where you kind of can back up, you can kind of see like which paths seem good or not. There's a more of a a rigid structure there, whereas maybe like in a uh you know, business negotiation or something, um it's less of a tree and maybe, you know, something a bit different.

C

Okay, so we're now seated so I can ask you some more questions about AfleGo and about EI research more generally. Um In twenty twenty one, Andy Jones had a paper called Scaling, Scaling Loss Reward Games. And um he basically anticipated inference compute or inference scaling by showing that you can trade off test time compute and uh training compute. That is to say that you can

spend more compute on the forward the searching through the MCTS. Um and if you do that you can get the equivalent performance as having

spent more time training the model. Um and so if you int you know you s if you see this pattern you might think, okay, well with LLMs you might do something like that in the future. And in fact that's what had ended up happening. Okay, so what is a kind of Fun exploration one could do now to explore other axes of scaling in toy settings which will be important to understanding what AI development might be like in a few years.

B

Sure, yeah. Um I think that indeed test time scaling and uh reasoning and how it interacts with model size are quite profound when it comes to like how much uh needs to be actually done as explicit search versus how much can be packed into the forward pass of a neural network. Right. And

And um how does a four passive neural network sort of learn how to do something that should be a sort of sequential and you know recursive step? That's quite interesting. Yeah. Um so the yeah, the Andy Jones scaling laws for board games paper is quite cool.

There's another really nice result from that paper where they sh uh where he showed that um not only can you predict scaling loss of like you know the sort of LLM variety where um as you increase parameters you can decrease the amount of compute for search or vice versa. Um

He also showed that you can actually predict uh how much compute is needed to solve a uh larger version of the board game, uh for example. And and so with Go, you know, which can scale from, you know uh three by three to uh infinitely sized, you know, uh go board, you might actually be able to sort of revisit this question and try to reproduce uh whether this shows up.

Uh you know, I actually started this project with this sort of a motivation that like, does the bitter lesson or does our knowledge of scaling laws allow us to kind of

execute a lot better on a sort of compute optimal go bot and can we can we kind of build a strong go bot without all the katago tricks, right? Just just by really focusing on the bitter lesson and scaling loss. I have not been successful so far, but I think it's it's sort of a a fact that like Usually when you want skilling loss to work

You want to be in the regime where um the the recipe already works and the data sets are good rather than trying to kind of figure out how to do scaling while also trying to figure out what the the right data set are. Okay. So so this is um like the scientific understanding component in research often follows a step where you get something to work first and then you use that um

system to collect data that then helps you build a mental model of how things work, such as scaling laws. And so usually actually if you want to build a strong GoBot using scaling laws, you actually have to make a strong GoBot first and then use the scaling laws to kind of extrapolate a bit farther into the way.

C

Say more, so uh uh d just so I understand, first of all, you're saying scaling loss did not work or you could not there was no scaling loss pattern that you could uh see in your robot.

B

Yeah, so um a mistake I made initially when I had some bugs around how MCTS labeling was working was I would um I would collect a bunch of data with an expert policy and then treat it as a supervised learning problem and try to identify scaling laws with uh expert data.

You can indeed plot things that look kind of like this, but if you're in a regime where, you know, your policy's not working well, you might be just studying scaling laws on like bad data. Right. So so just like one important implementation detail is that if you wanna

study a scaling laws problem. You kind of have to have a problem for which like the data is good, the architecture is good, and there's no bugs, and then like you you you you solve it there. Um ex ante, I wasn't able to like apply scaling laws to direct what what to look up look look at.

um until you know I had re the rest of the system working. And and this sounds obvious. Like to to researchers, of course you want to have like a working bug-free system before you study scaling. But uh just just as a sort of advice for practitioners on like where I actually tripped up when I started this project was You don't necessarily want to kind of jump into the science of studying your man made artifact before your man made artifact is like interesting enough to be

C

Speaking of compute, so if you you can look at these charts of compute used to train the best AI model in the world over time going back 10 years. And it's a very smooth line in log space. uh that is it exponentially growing year over year. Uh except there's this huge aberration, and that aberration is Affligo Zero, uh which is trained on way more compute than any other AI model at the time. It was like uh three E23 flops.

I mean that's sort of comparable to like a Frontier LLM. I mean orders of magnitude off, but still. Um And so yeah, the question is, especially with you being able to get something off and did you train it on your own

B

Uh I got a donation from Prime Intellect for like about ten K and then I spent Um I spent maybe the first four K doing um kind of exploratory research. Yeah. And then uh about three K on the kind of final run. Yeah. Um and then uh some some of it remaining for serving the model for

C

Yeah, is your sense that they were just did a bad job trading it if you can do it in ten K now? Uh

B

The compute required to be the first to do something is always like much larger than the compute it takes to catch up. And it's the same story playing out in LMs, right? Like once someone else has done it. Um you could use tricks like distillation, you could use um uh all sorts of like kind of uh crutches to kind of bootstrap your way to success. So with my own bot that I'm I've ho hosted online.

Um I actually used uh sort of best response training against the Codigo models to kind of get a strong level performance. And um, you know, as as a time of recording, I'm I'm validating whether this can be uh I can kind of do that first step, which is to do the tabula rasa play. Um but importantly for research you often want to start from a good init, right? So so the kind of simple thing I did first was train best response agents against

Caught a go. AlphaZero team, uh, they did not have any policy that they could train against, right? Because they were trying to do everything tabla rasa. So um and being the first to do it means that you're prioritizing getting the thing working rather than like, let's say, the most compute efficient uh possible implementation. So um this actually plays out in robotics as well. Like if you look at the kind of frontier of large models trained for robotics.

Um there are the scatter plot is all over the place and there isn't a very clean line the way that there is for Frontier LMs. And and that is because the folks training these models often are not At the scale where every flop counts, and they need to like kind of squeeze out the performance of every single flop as the dominating decision, deciding factor in pre-training.

Right. Instead their focus is more like we want a certain capability to to show up, so we optimize the uh training setup to kind of make it easy to derive that capability. And once you have that capability, well Um invariably if you scale up the compute, you are forced to kind of make it compute efficient because this is like hundreds of millions of dollars we're talking about.

A

But um

B

But in the past, when when compute for experiments was kind of more plentiful or you know not not uh not um uh accounted in a way that the researcher was really responsible for, then you kind of end up with people optimizing for things besides kind of being on the compute optimal predo from two.

C

I see. Like speed or something.

B

Yeah, like time to result or just getting it to work. I think the first AlphaGo, like probably they had lots of compute and they didn't need to be um they didn't need to worry too much about making it the most compute optimal.

C

And how much of the improvements to compute efficiency are methods that did not exist as of twenty seventeen versus things which you they could have done in twenty seventeen, but um

B

Yeah, great question. So so going into this project, I kind of knew in the back of my mind that like Things always get easier to do over time and I wanna see like where where is Go at, given that like it didn't seem like there has been any major open source, you know, uh strong bot bis after katago in twenty twenty. Uh and then, you know, reading the katago paper, there's a lot of clever ideas. I was kind of wondering like, okay.

Uh let's look let's see if the bitter lesson has happened where like a lot of these kind of tricks just sort of go away because the NVIDIA made faster GPUs, right? And and so uh roughly where are we on that? So um again, this is not a peer-reviewed claim, so uh this is just my preliminary um, you know. vibe guess on like what I've seen based on my own experiments. But it seems like um you know architecture choices don't matter that much. You know, Transformer versus ResNet.

We're we're at the sort of speed of GPU where the size of the model is not so big that this really matters. Um Um you can actually simplify the setup quite a lot. So instead of doing a distributed asynchronous RL setup with replay buffers and pushers and collectors, you can kind of do a a a dumb synchronous thing where you like collect. You just train a supervised learning model and then you collect again. And and so there there's like opportunities to simplify infrastructure.

Um Nvidia GPUs have indeed got faster. So whereas Katogo was trained on V100s, you can train on like half the number of desktop Blackwell GPUs, and it still works.

A

And

B

Um some of the kind of auxiliary supervision objectives that Katago developed aren't really necessary if you have a strong initialization. Right. So so if you're initializing against, you know, best uh response training against Katogo itself. then your own model actually needs none of the tricks that Katago needs. Yeah, yeah. So so then the core thing is like, how can you get as quickly as possible to some strong opponents? And that matters a lot more than the specific architectural innovations.

But there are still some c nice compute multipliers. So I found that training on nine by nine boards was very nice for resolving endgame value functions. And then like if you can co-train that on a architecture that can transfer between nine by nine and nineteen by nineteen, then you can really cut down the warm start time to learn that from I think uh AlphaGo Zero, their plot was um first 30 hours or so are spent basically catching up to the supervised learning baseline. Um

You can cut down that time a lot by kind of pre-training on a small board and then and then like you know worm starting that into your you know 19 by 19 board play. Uh There was some other stuff like, you know, varying the number of sims between episodes. This turns out to be not that sensitive actually. Like you can kind of, you know, fix it or increase it. Doesn't matter too much.

Um but so anyway, it's kind of just nice from a scientific perspective just revisiting like an old paper and seeing like what really matters.

Off-policy training

C

This is sort of tangential question. Why is it okay to have a buffer in Afligo?'Cause every time I talk to any researcher, they're telling me about how bad it is to be off policy. But then the way a naive implementation of Algo Zero would work is that most of the moves in a given backward step or in a batch of backward steps. um would be uh not not among the ones that were made by the most recently trained model. So why is that okay?

B

Great question. Yeah. And this this gets into the sort of fundamental off-policy versus on-policy reinforcement learning kind of question. So uh as you recall in MCTS, you take actions that you took. And you relabel them to uh to take different actions on the same states. Right. So the off-policy part here comes where um what if you're relabeling states that your new policy would never visit?

Like what's the point? You're kind of wasting capacity. And in the extreme limit, imagine your distribution of states in your training buffer are all states that you would never visit. Then you're basically supervising them to uh take

Good actions on states you would never achieve, and therefore your policy can get really bad. So this is where off policy can really hurt um uh uh alpha go. Um However, if you interpret this sort of from like the dagger perspective, which is basically saying like a way to kind of correct yourself back to the optimal trajectory uh given some some data, what what you kind of want in an algorithm like this is to have mostly states that you would visit.

But then you have a small percentage or maybe a reasonable percentage of states in this kind of high-dimensional tube around your optimal tr trajectories. And any of those states are given a supervision target to kind of f uh sort of funnel you back into your optimal trajectory. Uh so maybe I can just draw real quickly here. Great. So in

Um sort of a dagger style setup, what your kind of optimal training data distribution is, is that here is your optimal states and actions. So this is like, you know, you want to be in this state, you want to be in this state, you want to be in this state, and then you win here.

And then these are your optimal policy actions. So these are the things that you definitely want to train on. But to make it robust to disturbances, you want to make sure that if you happen to drift off into some other states, you can kind of funnel yourself back into

C

So why isn't this a fully general argument for op policy training?

B

This is actually why you want to do off-policy training sometimes. You you don't want to have a compounding error where if you make a mistake, you don't have the data of how to return back to your optimal distribution. And so um optimal control does not really say too much about like uh you know how to

uh you know, not accidentally get here because it's sort of making the assumption that like once you learn the policy you're gonna get here. But in applications like robotics, right, like like I don't know, uh a gust of wind blows you slightly off and then now you need to like correct, right?

Um or the friction on one of your tires is kind of a little bit like lower than the other wheel and then now now your your car's drifting and you gotta gotta like correct it. So so these kind of things in in like a more real environments often happen where like um actually there's a funny uh quote about

chess and also go is like the problem with uh the the problem with go and chess is that the other player is always trying to do some shit. Right. Like uh so so like you know things can kind of drift off. Yeah. And you always want to be able to correct uh back to your back to your winning condition.

So your replay buffer really should have like your, you know, the states that your policy would visit, plus some s distribution of states that you might drift to, and then how to return back to your optimal states.

C

Yeah.

B

Now, if you take this to the extreme and you say like, well, let's uh we don't have any of this data.

🔇 Silence

B

And we're gonna just like be labeling with MCTS um, you know, states that are so far away from our optimal behavior. Like this this bag of states over here. Well like now, yeah, I mean like each of them gets uh MCTS label and your policy learns how to do take sort of the best possible action here, but you never get here. So like you're training your model on states you would never reach.

uh like like this is this is not there. So then this is a problem. And and this is where off policy can really hurt. Yeah. Um so actually I tr uh as as part of this project I did try an experiment where I took a bunch of trajectories. And to try to saturate the GPU as much as possible, what I did was I took uh you know random states from the data set.

And uh re-ran MCTS on just those states. So instead of playing a whole game where I'm doing MCTS on every move, I just ignore the sort of causality of moves and just pick random board states. And I just label those with my current network. And uh and I might revisit old states that I've labeled before and relabel them again with my current network. And so in practice, this actually does work. You can actually say like, let's take some states that are reasonable.

and constantly be relabeling them um in uh in um while we're training. And and so this actually starts to converge on a very robotics-like setup, which is very common, which is you have your your data set of trajectories. Um And then you have something like a replay buffer pusher.

🔇 Silence

B

And these are off-policy offline trajectories, right? So your replay buffer pusher pushes transition tuples.

🔇 Silence

B

And then you have some job that's kind of continuously um replanning. what the best action you should have done instead of taking this action is. And so in robotics it's actually very common to use a a uh the sort of minimize TD error, so like your Bellman updater. Um constantly is pulling things from here and trying to satisfy, you know, the QSA.

🔇 Silence

B

So so um and then and then from here you have your trainer, which is trying to fit the S to A. Or or s or uh um fit the you know Q to the Q target. So so here you can think about this as a sort of planner, right? You vi revisit old um states that you've been to and you take your current model and you rethink like what could I have done better if I visited.

And um and so this is actually how like kind of off-policy robotic learning systems are usually trained. Um these days there's a sort of simpler recipe, but but like you know in the Google Qt op days we kind of did did things like this.

C

So what is the trainer?

B

Yeah, the trainer is uh you try to you try to minimize uh q SA and Q target.

C

which i can explain the whole setup again like at the high level

B

So you have your off policy data that came from various policies. You're constantly pushing uh transitions that you saw before to a replay buffer. And then you've got this thing called a a Bellman updater, which basically replans instead of this action, what action should I have taken at S? to have a better you know value. And and the way you enforce that is you try to minimize the TD error. So so actually you you given this, you have S prime, right? You you compute Q of S prime.

And you find the action that should go with S prime that makes this Q value as high as possible. And then you add that to the reward here, and that gives you your actual target, right? So for this current S and A, your Q target is this. So now you have a now now you send back the q-target to to this this transition. So you so with this tuple you pair with that a uh a q-target.

And then here on the trainer, you simply just use supervised learning and you minimize your current network's QSA with its target.

C

Okay. So the in the background you're just like Hey, let me let me basically think through how valuable were all these actions actually.

B

Yeah. Uh in a in a more optimal policy where you're trying to maximize this, what is the coup target of this transition?

C

Sort of like basically daydreaming.

B

Exactly. Yeah. You can think about it it's like you're kind of going back in hindsight and being like hmm like like given what I've seen in the h historical buffer, um like was there a better action I could have taken? Now the the connection to Go here that I I I tried and it was you know moderately successful but too complex to kind of like open source was um you replace this with like a MCTS relabel.

Um instead of doing this kind of target network computation, you uh run MCTS on your transition. Right. So in in this case, you have uh your state, your action, and then whether you want or not at the game. Um and actually you can just toss these two. You don't you don't care about these. You just take your state and you just plan MCTS to get your best policy, you know, pi. Uh on your current network, right? Not not the network that took this action, but your current best

And um, if these are transitions that your policy can get to, then this actually acts as a very nice stabilizing effect. And also, the one other benefit is that you can like kind of fully saturate your GPU better because you're not like blocking on the Go game to kind of like give you board states. You just simply search across all board states at any depth in peril.

So uh and then here the trainer would be just you know predict the MCTS label as possible. So so uh again, like this kind of works and this is quite relevant in robotics where you're really you just have a lot of offline data and you can't simulate things like MCTS. But um in practice, like it does run into the problem where, you know, like if the current model is looking at states that it would never reach, then it's kind of wasting capacity. And so you you have to be a little bit careful here.

So um the on-policy thing and also much of RL has kind of converged to a much more on-policy setup where they don't really try to like directly train on off-policy data. At best, they use off-policy data as a way to reduce variance, but not directly influence the objective.

C

I'm sorry, why have they conversed to that?

B

It's just more stable. So so like you you might use the off policy Q as a way to do like you know advantage computation. Um like uh you know Q. minus like sum of Q. No. That's kind of like your your or sorry, like you know, sum of uh like if there's n actions and then yeah, so so like uh So like this is your value, and then this is your your kind of current Q value. So your advantage for that action is like the average value minus your current one.

People can try to estimate Q in an off-policy way and then like just use advantage here and then and then the the sort of if there's a problem in these dynamics, the it doesn't like blow up your loss as much. And so in robotics there's a kind of convergence towards more like a using off policy data to just shape your rewards but not actually be directly your

RL is even more information inefficient than you thought

C

I'm reminded now of our earlier conversation of the first time. why MCTS is so favorable as compared to the kind of r you know, reinforce a policy grading kind of thing L LLMs do. And this might be totally wrong, but I wrote a blog post a few months ago about um how RL, at least policy gradient RL, is even more uh inefficient than you might think. And so the inefficiency one thinks about naively is the fact that you have to roll out a whole trajectory.

in order to get any learning signal at all at all. And so as these trajectories become longer and longer, as an agent has to, instead of just previously like complete the next word in the sentence, it has to go instead to, hey, sp do two days worth of work to figure out even if you even did this project correctly. The amount of information per uh flop has been decreasing.

As you had to unroll two days worth of thinking in order to see if you even did something correctly to like did I f implement this feature, the amount of samples per flop has been decreasing. But so you can think of um Uh you're trying to maximize as you're learning bits per flop, right? Um And this is you can think of bits of it per flop as um samples per flop. times um uh bits per sample.

And what I just mentioned uh a second ago is that the samples per flop go down as RL becomes more and more along horizon. But um at least this kind of naive RL is also terrible from a bits per sample perspective. And here's what I mean, at least compared to supervised learning. So early on in training, let's say you have a

uh vocabulary size for an LLM that is 100 K long. So th there's a hundred K possible you know tokens that one could answer. And you have a totally untrained model and you have a prompt like The sky is um With supervised learning, you what what would happen is that the model would have some probability distribution over all the things it could say.

Um there's a label that says actually the term here is blue. And it would figure it would learn basically through cross-entropy loss exactly how far its distribution is from correctly saying blue. Now if you're doing this through RL, um You would say the model would try the sky is Halicon. Nope, that's wrong. The sky is told. Nope, that's wrong. This is a totally untrained model, right? And so you would have to do this.

uh on the order of a hundred thousand times in order to just stumble on blue, then get some learning signal off of that. So if you're in the supervised learning regime and you just get you have your distribution of probabilities, you get told uh that it's blue and you figure out how far off you were, the amount you learn is

um is a function of your pass rate. So like the further away you are from blue, the more you've learned to go towards blue uh using cross-entropy loss. And so you can think of it as like your pass rate, your like prior probability of having said blue. And um as a function of that, like In supervised learning, uh, through cost entropy loss, you would you would learn negative log p, p being pass rate uh bits.

Once you get this label. Whereas in RL If you're just randomly guessing shit and seeing if it works or not, that's um that's just basically going to be the entropy of a binary random variable, which is

B

And what's also tough here is that actually the distribution that you're sampling under is your policy's distribution. So it's like if your policy has no chance of sampling blue, then you will never get a signal.

C

Exactly, right, right. So that's that's being modeled by the fact that Your probability of sampling blue is extremely low. If you do sample it, you do learn as much as you would have learned in a supervised learning.

In all other cases, like, you know, ninety nine point nine ninety nine percent of in an untrained model, you're um you're just learning incredibly little from like seeing Halicon is not the correct word or told is not the correct word. Yep. Um and that's what happens most of the time. So you're just like Um learn very little. So if you try to graph, um if you put on the x-axis your pass rate, um And uh here you put the like sort of the bits you bits you're learning from a sample.

If you have like zero percent here, fifty percent here. and a hundred percent here. So the end of trading you're here. Um if you have um Supervised learning, negative log password would look something like this. And then the uh entropy binary random variable would look like this. Um and this is uh depending on whether you're doing knots or bits.

Oh yeah, if you do bits, it's like one, right? Here at the at the peak. Um this is like a coin flip. You learn the most from a coin flip. Uh this is supervised learning. This is RL.

A

However,

C

The problem is you spend most of training in this regime, right? Like in the in the low pass rate regime. And um In fact, if how fast you're learning is a function of how many bits per sample you're getting, uh and you're getting very uh little signal here. If you chart the pass rate on a log

scale. So you put the x-axis on a log scale where like at the beginning of training with a vocap size of hundred K, the pass rate is one hundred one over one hundred thousand, then one over ten thousand, one over one thousand. Uh one over one hundred and then f um okay, what this graph looks like here where supervised learning would look like this. And And then R L, if you just basically cr crunch what I just sh uh showed there.

B

Yeah. And arguably you spend all your time here. Potentially e never even getting a single success. Exactly. Yeah. Um, once you're here, you have something. But like you actually in many RL problems spend All the time here. Uh so th there's a sort of question of like how do you initialize so you're at least not at zero, but like at a non-zero. Yeah. One more thing I'd like to add about bits per sample that's very relevant to um uh you know you know any kind of machine learning problem is that um

And there's a connection to soft targets and distillation where if you have access to the logits, right, not just the one hot like this this is the sort of one hot uh token answer. Yeah. Um if you have access to the soft targets. um the entropy of this distribution is far, far higher than than the one hot. So there's actually way more uh there's way more information in bit and bits per sample.

Um in a soft label. So that's why distillation is so effective per sample, is that it's actually giving you way more information per sample.

C

Yeah. Well I wonder what the equation would be. But obviously it's

B

So the entropy of this is zero. Yeah. The entropy of this is like, you know, the entropy equation. And this is also why like you know AlphaGo is quite beautiful. In AlphaGo, you don't train the policy network to imitate the MCTS action, you train it to imitate the MCTS distribution.

Both of these are actually valid. And if you wanted to do a scientific experiment of like how important are this kind of soft label s dark knowledge distillation, you can run an experiment where you you uh retrain the policy network on the action MCTS selected rather than the software.

C

Interesting. Uh earlier I was sort of stumbling around. Th this intuitively, why is this ability to do um iterative search where you don't necessarily need to be able to win the game in the beginning. You just need to be able to improve your current policy. Why is that so powerful a capability in learning as compared to how LLNs currently run our learn RL? And um and yeah, it's exactly this thing of uh this is considering your pass rate of the entire trajectory.

I s actually don't know a formal way to think about this. Maybe you should help me out here.

B

W why is Alpha Go an elegant RL algorithm? The major reason is that you never have to

initialize at a zero percent success rate and solve the exploration problem of how to get a non-zero success rate. And and this is what allows you to hill climb this beautiful supervised learning signal where and if you look at the actual implementation of AlphaGo, um every step of the way, there's no f uh there's actually no um, you know, TD error learning or dynamic programming, at least explicitly, it's just supervised learning on a value classification as well as a policy KL minimization.

So it's just a superv supervised learning problem on improved labels. And so the training is very stable, right? You can train like as big of a network as you want, you can kind of retrain this on the data set. Everything will just go stably. The infrastructure is very simple to implement as well. You don't need a complex distributed system to kind of keep everything on policy. At the end of the day, you're just saying like

I have some improved labels, let's retrain my supervised model on these targets. And and so you're you're always in this beautiful regime where you're just trying to improve the policy. rather than uh escape this kind of like uh sort of local minima where everything every signal's flat all around you. Um so so one way to draw the the curve is like if you draw the sort of win rate of an MCTS policy versus the raw network, um let's say this dotted line is the raw network.

The MCTS policy kind of looks like

A

Like this?

B

And so every step of the way, this supervision signal is very clean. Right. You're never in a situation where the MCTS is kind of like giving you no signal unless your MCTS distribution converges to exactly what your policy network predicts.

C

Yeah yeah yeah.

A

Um

C

Cool. Okay.

Automated AI researchers

B

Sounds good.

C

One thing I really wanted to talk to you about is that you did a bunch of the research for this project. through this kind of automated uh LLM coding assistant loop. And um

There's an idea that if you fully automated AI research, you could have some sort of singularity. Uh obviously we're not there yet, but to the extent that we have early indications of what this process might look like. I am curious what your observations about um what the AI is good at, what it's not good at, what you think about this scenario, it's likelihood eventually What thoughts do you have about this in general?

B

For sure, yeah. I think automated scientific research is one of the most exciting um uh Skills that the Frontier Labs are developing right now, and I think it's important for everyone who's doing any kind of research to get a good intuition of like what it can do now and what it can't, and how might the sort of science process work in the future once we're having AI is automating a lot of this investigation.

Um so uh in brief I mostly use Opus four point six and four point seven throughout the uh working on this. And um what works is that the Models can do a very good job of doing hyperparameter optimization. So in the past, people would kind of come up with a search base of hyperparameters like learning rate and

you know, weight decay and maybe how many layers are in your network. And um they would just kind of do a grid search or a sort of Bayesian hyperparameter optimization uh approach. And then it would find some tuned parameters. Um the kind of really cool thing that automated uh you know

uh coding can do now is that it can search a much more open-ended set of problems. It can say like, well um I've identified that like the gradients are kind of small in this layer, so let me change it up here. Let me rewrite the code so the data later ha data loader has a new augmentation I came up with.

let's uh let's uh sort of try to find the the best way to kind of fit the constraints of the optimization problem. And and you end up with this much more flexible and kind of high level, almost like grad student like uh ability to just, you know grind a performance metric. And and so this can squeeze out quite a lot of performance. You can, you know, f on a fixed data set with a fixed time budget um improve perplexity by quite a lot on on a sort of classification problem like LMs or um or Go.

And uh it is also fantastic now at basically executing any experiment. So I have a claude skill that I wrote called experiment, where um I give it a description of what I wanted to plot. And like I just describe here's the x-axis I want, here's the y-axis. answer this question for me. And it'll go run off and do all the experiments, compile the plot, make a report, and suggest like, you know, what might have caused it or or so forth.

So that's what works quite well today, and I think we can expect that these abilities get better in the future. But it's also kind of useful to know, you know, what what is it not doing so well today? Um so on my uh blog version of this tutorial, I have a a plot of basically all the kind of experiments I did grouped in a sort of tree.

Where every node kind of represents a failed, successful, or sort of mixed experimental result. And then from there it branches off into a child where it's like the follow-on experiment. Um occasionally I'll kind of rabbit hole down a track like this off-policy MCTS relabeling, do a few experiments, and then realize it's probably not worth it. So then I'll kind of jump to a completely different track. And I call these kind of things like rows.

Right. So so what what I find is that current uh you know closed models that we can access, the public can access today, um they don't seem to be that great at selecting what the next experiment should be in a given track. And they don't seem to be able to kind of step back and do the lateral thinking of like, wait a minute, this track doesn't really make sense. Like let's go back to sort of first principles and and think about, you know.

what the bottleneck might be or like what are we trying to achieve. And so often I had to catch infra bugs myself by prompting the right question to Claude to like investigate, you know, why why what what is causing this discrepancy? And then it'll answer the question.

Um I think with like you know Mythos class models or Mythos plus plus models coming online, um maybe this just completely changes and these these problems just fall to to just improve scaling. Um but at the same time I think there's a lot of like rich opportunity to

develop RL environments that might incentivize this kind of lateral thinking. And so one of the motivations for setting up this Go environment was that I think that Go captures a lot of very interesting research problems, often overlapping with you know LLMs or robotics. And yet it's like very quick to verify. Um the outer loop is ultimately like does the agent do what I think it does? And and you can kind of check the outcome of a Go game quite easily.

And then the inner loop involves all this kind of like you know research engineering around distributed systems, uh predicting whether an idea is gonna work or not, um predicting the you know the difference a particular modification to your training algorithm might make. Um and I think there's a rich

library of sub-tasks and sub-environments that you can kind of train an automated scientist to work on with Go as a sort of outer verification loop that then once you acquire these skills, maybe you can apply them to like other domains like you know um biosciences or robotics.

C

Or automating the A resource.

B

automating.

C

Which is which is the real crux or the um scary slash uh incredible thing uh w for just making AIs making future versions of AIs. And you're suggesting the outer loop here could just be your win rate against Katogo, basically? Uh

B

Um that's one of them. Um I think there's a lot of deeper questions that one could tackle, right? So for example, um let's say you have an idea on um how to improve a scaling law compute multiplier. Um The outcome isn't necessarily like I I uh achieved the best GoBot ever. The outcome might just be like, can I predict what the win rate of my GoBot will be? Yeah, yeah. Uh or can I predict the scaling law plots that emerge from my idea.

But then you can verify that you haven't kind of reward hacked anything by using a very verifiable game like Go on the Outer Loop.

C

I I think there's a couple of interesting follow-on questions. There's questions on the inner loop and the outer loop. On the inner loop, there's a question of how locally verifiable any modification you might make is. That is to say, could would you know whether something is actually improvement or a degradation, some idea you try out? Would you know that if something isn't working as a result of

um a bug or is it the result of the idea itself being wrong? Um Ilya was talking about why having one of the reasons he th he thinks he's a good researcher is He is a good researcher. One of the things he thinks makes him a good researcher is that he has intuition about he has strong belief in what the correct idea is.

And he is able to persevere through bugs and know which things are bugs versus mistakes in the fundamental idea based on his high level belief about this idea should work, so therefore it's this h there has to be bug versus the other way around. Why don't we start with that question actually? Yeah. How locally verifiable are things which are good ideas?

B

As in the case of the success story for deep learning, you can think about this as like a decades long idea that took like took a lot of faith to get it to work. Um And so this presents a very challenging long horizon you know RL problem where you know every every step of the way you have like a committee telling you that this is a bad idea and then ultimately you break it through.

How do you design RL environments that maybe give you some feedback uh uh earlier? Um and and I think this is a very tough open question that I don't have an answer to. But um But you know, ultimately to play a very strong go bot, you probably did need to discover deep learning. Yeah. Right. And so um Um I think that like having a

challenging game that cannot be cheated easily on the outer loop could be used as a sort of outer loop signal for something like discovering the principles of deep learning. Now, of course, like to make it tractable, and this is where research taste really matters, like

you have to come up with ways to initialize your problems so that you don't solve a sort of very intractable problem, right? Like like maybe you can leverage LLMs as as a sort of a universal grammar in the middle to kind of give give you some sort of local feedback.

Um the the the fact that LLMs are universal grammar means that they can kind of move at almost any level of the stack, right? They can think very locally as well as step back and think like in very broad steps. And I think that's where a lot of uh um the the lateral thinking ability of humans kind of come from. Like like how to know if the track that you're pursuing or the objective that you're pursuing is not right and you should be asking a different question.

C

The uh the other question is how stackable local improvements are in the attempt to get to a better result on the outer loop. Um, I've heard rumors that at some AA labs the thing that has gone wrong is that people will individually pursue good ideas. Um, but those don't end up stacking well and so the trading run falls because of some weird

uh interaction between two seemingly good ideas and having a single top-down vision of how things should work is very important. Um having worked at uh different AI labs and also playing around with I guess parallel agents trying different ideas. What is your sense of how parallelizable um AI AI innovation is?

B

Yeah, great question. Um I think the research taste for executing well on you know the bitter lesson is that you need to know how much the bitter lesson can buy you and uh how much is too much to ask for at any given moment. Right? Like of course in the fullness of time.

is the single most important determinant on like how things work. And uh and and uh it's almost like inevitable that as you scale up energy and compute uh and parameters, intelligence will just fall out of that. And that's super super beautiful, super profound. No algorithmic detail really matters beyond that.

But um in present day, we don't have infinite compute and parameters and and inish and arbitrarily good initialization, so we have to come up with like heuristics that kind of give us that. But these heuristics are probably somewhat redundant. So that's probably why you see this effect where like a lot of these compute multipliers don't necessarily stack, is that like they might have some correlated benefits.

Um and then and then you know, three years down the line when the NVIDIA GPUs have gotten even stronger, maybe maybe they stack even less well. Right? Like may maybe like at any given point in time the the the sort of benefit of any given compute multiplier is transitory. Which is what I sort of suspected with the Codigo paper. Like there was many

algorithmic ideas kind of applied. And then you can see that like with you know modern Blackwell GPUs and Ada class GPUs that are much better than the sort of V100 uh grade GPUs that that paper used. Um you can see that like some of these algorithmic tricks to speed up convergence just don't matter so much compared to something else. And I think that's a matter of taste in a pr in in the present time.

C

Interesting. How about the outer loop? How verifiable for making AI smarter? With Go, you do have this outer loop of um win rate against the best open source model out there. And even there, as you were saying, there are other outer loops of did you discover a new phenomenon, which is actually very hard to

If you didn't know scaling laws were important, if you're back in when was Chinchilla or Kaplan scaling laws released, like twenty nineteen. Yeah. So if you're back in twenty fifteen, would you there's not an automated procedure one can easily imagine of uh knowing which paper is the scaling loss paper versus which is just like another random plot. And so e that even in the Go case is uh hard to verify outer loop.

And the whole whole ho the whole idea of an outer loop is to have like some backstop on um on improvement. Uh but let alone for general AGI. Where of course we have a bunch of these benchmarks, but there's a problem that like we know the things we can measure.

And we improve on the things we can measure, but we're we care about this broader ability to do economically useful work, which is um at least until you automate everything, easy and not not super easy to measure. Um So yeah, there's a there's a question up, okay, how how good is the outer verification loop for uh for AI self-improvement and does that matter?

B

Yeah. I'm gonna give a non-rigorous argument, but one that I kind of intuitively believe, which is that, you know, um DeepMind, the AI research lab, um they started as a sort of focus on games, right? Like they they kind of used games as their outer loop. And then the researchers learned uh from experience of solving games. And then like now they're working on LM.

And presumably there was some positive transfer from their time working on games and like Atari and Go and and uh you know StarCraft that like now helps them make good LMs. Like I assume that there's like positive transfer in in some regard, whether it's coding or general research ability or project management, right? Like all these things kind of like probably help them do well. And so if that's the case, why wouldn't it also be true for automated AI resources?

Like they should be able to positively transfer experience tackling quick to verify, uh quick to iterate on environments to something more ambitious and economically useful, like uh you know automating drug discovery or something.

C

I mean I don't know, if i isn't the i hasn't the issue with uh historically until uh y Gemini three or whatever been I a couple of years ago people were saying, look, uh Google hasn't uh isn't catching up in LLMs because they're too tight to the old approach. And yeah, there's gains, but there's also um there's uh ways in it which actively hinders you. Um so it's actually not obvious to me that there's like

B

The jury's still out, right? Like I I think like who knows if the you know l let's say currently Google's doing quite well. Who knows if the uh initialization on training on games is ultimately gonna hobble their ability to be the winner in the long term, right? Like

Like uh it's it's hard to say for sure. Um and uh you know likewise, who knows if the late s seeming late start was really just them kind of pre-training for longer on how how to like scale up TPUs, right? They they invested all their tech tree in like

Uh getting TPUs to be good, which seemed not that useful in the short term, but then in the long term it becomes maybe like a So it's it's even hard for humans to reason about what the optimal research strategy should be, right? Um even with uh the the data we have today.

C

Um okay, we should let people know how they can find out more about this project, whether to fork it themselves, whether to check out your blog post where you do an excellent job explaining many of these ideas. Um where do people go next?

B

Great, yeah. So my my website is uh evjang.com. The there's a blog post that kind of links to a interactive version of this tutorial. Um and on my GitHub, uh which is the username is just Eric Cheng, uh there's uh there's a uh auto go repo that people can fork and reproduce the uh training results.

C

And I also highly recommend people check out this blog post as Rocks May Think, which we touched on some of the ideas in this conversation, but it's this uh grand or m uh uh uh you know um thesis of what happens when you have Uh thinking as a primitive in computer science. Exactly. Right. So I highly recommend people check out that blog list as well.

B

Yeah. And I encourage the you know the audience to th you know think about the relationship between thinking and go, you know, via MCTS and search and how it relates to LMs. I think there's something quite like profound there. Um and Probably underexplored just because Go has been relatively underexplored compared to you know the boom in LMs. Um it's not to say that I think we should have trees in our

in our LMs, but um but but there is some very interesting duality between them and you can actually do a lot of research on Go, um MCTS and and reasoning with you know very small budgets. So that's very exciting.

C

Cool. Awesome, Eric. Thanks for doing this.

B

It's it's an honor to be on the podcast.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android