⚡️How Claude 3.7 Plays Pokémon - podcast episode cover

⚡️How Claude 3.7 Plays Pokémon

Mar 04, 202538 min
--:--
--:--
Listen in podcast apps:

Summary

David Hershey from Anthropic discusses the creation and mechanics behind Claude Plays Pokémon. He explains the project's origin as a tool for experimenting with agents, the architecture, and the challenges Claude faces in navigating the game, including vision and memory limitations. David also touches on the model's learning capabilities, token usage costs, and potential future improvements.

Episode description

Special lightning pod with David Hershey from Anthropic, the person behind Claude Plays Pokémon. Sonnet 3.7 is currently trying to complete Pokémon Red live on Twitch thanks to a special harness that David built so that it can see the screen, navigate through it, remember facts about the game, and more. (Since recording, it has successfully escaped Mt Moon! You can follow along on Twitch: https://www.twitch.tv/claudeplayspokemon)



Get full access to Latent.Space at www.latent.space/subscribe

Transcript

Hey, everyone. Welcome back to another Latent Space Lightning Pod. This is Alessio, partner and CTO at Decibel. There's no switch today. We got a special co-host, Vibu, which if you're a part of the Latent Space community on Discord, you've definitely seen. welcome people as a co-host first time what's up guys and that we had david hershey from entropic today who's the person behind cloud place pokemon

It's funny. I saw we have first DMed about playing magic, the gathering together and, uh, and assess. And then people are like on all of the different nerd angles. You can get me. uh and then people are like david is the person doing this and i was like okay i'll dm him and then um yeah it was cool we already had a touch point so

Welcome to the show. This is our second Anthropic episode. We have Eric Schlantz from the sweet agent before. So welcome. Thank you. Glad to be here. Excited to talk Pokemon. Yeah. So let's give a little background on this.

Sonnet 3.7 came out a couple weeks ago. I don't know. Time goes by this week. Monday. This week? I don't know, man. It feels like two weeks ago. And then you had this Cloudplace Pokemon thing that kind of went viral, where if people remember, there used to be this thing called Twitch.

place pokemon where people could go on twitch and kind of type in the chat and then busy like figure out what the next section that the emulator would take is what you've done instead is given it and it's a cloud and basically have cloud figure out how to walk through it i'm looking at it right now

So far, it's been stuck in Mount Moon for 52 hours. Poor guy. It probably met 15,000 zoo vets. So yeah, let's talk about what gave you the idea for it, kind of the origin story that we can go through the implementation. Totally. Yeah. So I actually started working on it in like June of last year for the first time. And for me, so I work with customers at Anthropic and I just like really wanted to have some way for myself to be able to like.

experiment with agents like in a real way some framework some harness where i could actually just like go to town and try some different things and see see what actually worked to get caught to do like pretty long running tasks in general and so i like had that in one hand and then i was like okay what is the thing that will make me the most addicted to making this work like how will i grind the hardest actually trying this

And Pokemon was a pretty clear answer. Someone else at Anthropic had actually tried once to hook it up, so I had a little bit of the shell of what I needed to actually put it together. and to like kick off what became an obsession a little bit in the coming months. So yeah, like I played with it in June in the switch, trying things out. This was like Sonic 3.5 came out in June of last year, which is when I...

started kicked it around it's very good but like you know you could see like kind of signs of life but like not much really happened um and then ever since then as we released new models it's sort of been like the way that i get to know one of our new models a little bit right so we released the new version of sauna 3.5

in october and like use this to like really kind of see like what's it better at and it got better like you could see it start to like it could get out of the house somewhat reliably which was not always true and it got a starter and it like even named it sometimes like It was like doing stuff. Not great, but like it.

it could move um along the way too like i'm just like we have a quad place pokemon slack channel like i'm sort of just like giving people updates so over time as i'm like posting gifs and up parties updates like i'm this is like slightly growing a popularity of uh cult following internally if people are somewhat interested but then like uh you know a couple weeks ago i was bashing an early version of sonnet 3.7 and it just like

you can just tell it had like it was a little different it's clearly not still good as you said at the top like it's it's in mount moon for its 50 something hour this is a little bit worse than average from what i've seen so far by now but like this is like you know about on brand it doesn't really have a great sense of

direction it's pretty bad at seeing the screen stuff like that but like it plays the game you know like it gets pokemon it catches pokemon like it caught its first pokemon it got out of the radio the first time like a whole bunch of stuff happened for the first time we're like could squint and see a thing play in the game and yeah like posting updates obviously internally it was very fun like people were just like kind of going wild at the fact that this was actually happening finally um

and it was like entertaining enough that i could kind of see it and the other side is like we kind of just like got finally a sense that this was like an actually useful way to measure what was going on with this model you know what i mean Like there's one thing that it's like fun and fun follow along, but like internally, like I think we got more of a sense that like you could actually use this as a bit of a measuring stick for what's going on in the model.

i've spent you don't know how many hours i spent staring at quad play pokemon i've so much i have to have seen and read like millions of words that claude has generated in the course of playing pokemon over the last eight months so

like you can kind of get a feel for like what's actually going better or what's it getting better at and that kind of thing and with this particular release like i think the fact that it got this much better at this kind of reflects a lot of things that we wanted to be true about the model to begin with and

And those sort of lined up where like, okay, maybe this is like an interesting way to actually tell people about what's going on here for a crowd that maybe doesn't like quite know as much about software engineering and all the other ways we've told people about agents in the past. Yeah.

uh were there any other games that you consider to me seems like pokemon is good because it's like you know isometric you know it's kind of like flat so you get any score then it's it doesn't have too many hidden facts about objects, you know, kind of like everything it's described. Did you consider anything else or?

Was Pokemon just kind of like by far and away the first choice? I didn't, but it's mainly because like Pokemon was the first game I ever got as a kid, right? This is like purely coming out of my own nostalgia. um but also like the twitch place pokemon like i i was also something that i cared a lot about well a decade ago or whatever that was um please tell me it's not a decade ago i think it's actually gonna go i'm sorry yeah painfully um and

11 years ago. Yeah. That's insane. February 2014? Yeah. That is nuts. Okay. Pokemon Red is 20 years ago. Oh my god. 20, 25 at least. so yeah i do for me it was that like since then there have been a lot of people and probably like oh we can do this we can do this we can do this i think there's like a lot of fun things you can do pokemon's actually really nice because like if you don't do anything for five seconds like there's typically not a consequence

By the nature of doing inference on a model every snapshot of time, it's actually a pretty good game to be able to do this with. But yeah, it was mostly just my love for Pokemon coming through here. Um, you put together a very nice architecture diagram. Do you want to screen share that? So people on YouTube can follow along and then we'll put in the show notes. If you are just listening, um, I know that Vivo had a bunch of questions on the, uh, on that too.

Yeah, let's do it. Very, very straightforward questions. Basically, can we just double click into all of it? Yeah, yeah, yeah. It's easy. I found it off Twitch and like no one was talking about it. So I started sharing it around and I lost the original source, but basically everything in here is like pure gold.

the memory is a little interesting but yeah if you want to just go through high level yeah you got it yeah i i want to like preface that i do not claim this is like the world's most incredible agent harness in fact

I explicitly have, like, tried not to, like, hyper-engineer this to be, like, the best chance that exists to beat Pokémon. I think it'd be, like, trivial to build a better computer program to beat Pokémon with Quad in the loop. This is, like, meant to be some... combination of like understand what quad's good at and benchmark like and understand quad alongside a simple agent harness so what that boils down to is this is like a pretty straightforward tool using

agent from my perspective is how i would frame it so at the end of the day like the core loop is just like having a conversation that rolls out and it's essentially like you build the prompt including like everything we've had up till now you call the model it sends back some tool use typically you resolve those tools and then talk about summarization but like basically some a few different mechanisms to maintain

the information you need to do something long running inside the context window um so like what this boils down to is like when you think about what an actual prompt looks like it rules out kind of like this you've got tool definitions which describe three tools that i'll get to in a second a short system prompt it's like pretty boring it basically tells the model how to use the tools and like there are about six facts about pokemon that i give it

and like a few corrective things that i've seen it do like really horribly wrong i'm like hey you might want to consider doing this a little bit better but it's like really not a lot of system prompting going on Uh, we have that knowledge base, which referred to you. I'll talk about this is the main way it stores like long-term concepts and memories as it's operating over time. Um, and then the, the bulk of things is this conversation history, which is.

it's like a chain of tool use there's no like user interjections at all for the most part so it's like go and then the model uses the tool and then it gets a result back and then it uses another tool and it gets result back so uh pretty straightforward uh feel free to like cut me off too if you've got questions along the way but otherwise i'm gonna keep rocking yeah yeah go ahead cool uh okay so most of the money of this is just like in the tools themselves when you think about what's going on

it's really like it can press buttons and it can like mess with its knowledge base and that's about it i'll talk about navigator separately because it's like a patch for how it actually can deal with some of its vision deficiencies Using the emulators, they basically execute a sequence of button presses. It'll say, like, press A, B, left, right, whatever. It gets back a screenshot, and a screenshot overlaid with coordinates of the game.

These coordinates are used for this navigator tool that I'll describe in a second, but it's just basically like help quad get a slightly better spatial sense of what's going on on a Game Boy screen. I've been through it a lot. Sorry, does it come with the emulator or are you adding those in?

I add that in. Okay. I have, uh, somewhat extensively reverse engineered Pokemon Red by this point to, like, extract roughly every bit of possible information from it. I don't use most of it, but, like, I have... essentially everything you could know about the current state of the game i have exposed programmatically to be able to tinker with it at this point

I was just reading this diagram, like, yep, you just get what spaces are walkable based on what's stored in RAM. And I'm like, oh, you definitely reverse engineered this little. Yeah. Good news is we also released Claude code this week. If you saw that. And that has been, this would all not be possible without the help of having quad also go figure out how to do all of this for me. Cause I could have done it, but there's a lot of like tedious.

Here are addresses in memory. Map that to a Python program that I had no interest in doing. So thank goodness for quad code. So yeah, it gets these two screenshots. It gets like a small blurb of state, which I read straight from the game.

there's a lot of this here actually like funny enough the thing that matters is location quad will like pretty aggressively hallucinate that it succeeded in transitioning between zones if you don't like tell it it did not uh this just comes down to like literal vision issues and and so like most of the patching of extra help i've given it been like attempts to make it so they could still play despite not being very good at seeing

Game Boy screens in particular. Um, and then it gets like a handful of like reminders. This is this reminders is a decent amount of work, but it's like things like, yeah, remember to use your knowledge base occasionally. Yeah. And. It would tell if it gets stuck, for example. So if you detect that it hasn't moved in 30 time steps. I once saw it see a red box on the screen that was the doormat and think it was a text box.

and spend 12 hours pressing a overnight to try to clear the text box which you see that happen once and you add in some some helpful reminders to not do that How much knowledge does the model have about the game itself? You know? So for example, types, right? Doesn't know about types weaknesses and things like that, or how much are you trying to put into it?

yeah if you go to quad.ai like it it will tell you about like some stuff i have not yet decided if the knowledge that it has about pokemon is helpful or harmful towards it playing the game um like half of the time when it's like oh i know this about pokemon it then like uses that to hallucinate something so for example beginning of the run on twitch you saw like go out of the lab and see like this npc in the bottom of pallet town and be like it's professor oak

i found him and it's like very much not professor oak but like the fact that it has like indexed on this concept is like a little stuff like that that it's like unclear to me where it is but it clearly has some information about it uh there's like a million game guides about pokemon sitting on the internet it's unsurprising that like there's a decent amount of information there i don't really give it a lot of extra information it picks things up i watched on the stream the other day like

it tried to use thundershock on a geodude and it failed it's like i forgot about that that does not work and so like clearly there's like it knows some stuff it's not perfect it picks some stuff up as it goes through the run

Ideally for me, like, I think it's just interesting to see like what it actually learns as it's playing. So the more it does that is the more I'm like actually interested in it. Yeah. The, um, one of our, the score members, not junkie had a good question about the sense of self. Yeah. Like sometimes it gets confused.

Who is the actual playable character in the scene? Like, how do you steer that? Uh, yeah, I think like sometimes it gets confused. Uh, it can be applied to many things in quad playing Pokemon. in particular when it's trying to like look at the screen and understand what's going on so i have like attempted to prompt it all sorts of ways like you are at this exact coordinate and you're in the middle of the screen and you're wearing a red hat and things like that

And like, that's all neat. But Claude doesn't particularly understand like the middle of a Game Boy screen and a whole bunch of concepts like that, which means like you can prompt. all around everywhere, but like this kind of like spatial awareness and where something is with respect to something else is something that Quad is still just like not great at in its current incarnation. So one other side is sometimes lose track of who it is on the screen.

thinks there's something else there i'll keep tracking through this so i hinted at this like other tool that i give it called navigator and this is just like the only other patch that i have for the the vision issue So Navigator basically what it does is like quad can say it wants to go to one of these coordinates, uh, that we provide in the screenshot. And then we like automatically press the buttons to get there.

It has to be something on the screen. I'm not trying to let Claude just navigate a whole map by asking too politely. But one thing you'll notice if you run it without this tool is if Claude wants to get from one side of a wall to another side of the wall.

it like happily just tries to walk through the wall repeatedly because it doesn't quite have the concept of like what's between it and i've spent a lot of time like prompting around this and it just like isn't it's just not it's one of those things not very good at so in order to make it somewhat fun to learn from quad playing pokemon at all

We use this navigator tool, which like helps it actually get around a little bit better. So since we covered a bit about the different tools, the prompting and the strategies, I'm curious how many tokens all this is using, like. There's a part to conversation history and truncating parts of the messages in state. But like, yeah, at a high level, how many tokens is this using? And then can we kind of go into where those are coming from? What's being truncated? Yeah, you got it.

uh when you like think about the the prompts here uh essentially like every step something that looks like this gets sent so if you just go through what each of these looks like everything in the system prompt is probably like a thousand tokens, pretty small, like a handful of paragraphs. Knowledge base, I let get up to like 8,000 tokens. So I put some like arbitrary cap on it so it doesn't go to like quad will write, put a whole bunch of.

bs in there if you just let it keep writing stuff so like the cap helps constrain it to like try to think about what's actually important a little bit and then the conversation history i haven't like kind of finicky but it basically rolls out

um 30 messages that's actually like something you can tune i've tuned it to be 30 messages about like the best performance i've gotten and so what that means is it basically like uses a tool get a response back use tool get a response back it's allowed to do that 30 times and then

At that point, it triggers the summary, which takes that conversation history, summarizes it, makes it the first user message, and then we kind of roll back out again. So the bulk of the tokens end up being in the conversation history once it's its longest.

In fact, the bulk past that ends up being these screenshots, which are scaled up a decent amount to fit in. I do actually like... i allowed to see a number of the previous screenshots but not all of them because you start like it ends up being a ton of context if you'd let it see like even 30 turns worth of screenshots so i'd trim out a few that's where the bulk of the actual tokens are so

in practice this rollout ends up like at max ending up around 100 000 tokens i think is where it is like the the longest message you ever send to the api on one of these turns and it will it will fluctuate

In like summarization, depending on the state of knowledge base, probably between like 5,000 and 100,000 tokens. And is that like per action state of the game? And roughly, do you have like a high level... ballpark estimate of how long this how much and how long it costs to run this like let's say people want to compete and yeah yeah like how much i think you'd really want to think about running this as a side project

in terms of the impact on your personal wallet and how much you care about Pokemon. It's not clear to me that without the blessing of Anthropic, I would have decided to take on this project for my own wallet's sake.

uh especially if you want to like experiment and like try 10 different things i mean it's it's costly i don't know like i haven't spent a lot of time on the exact number it's not that hard to estimate if you like i just told you a bunch of numbers you can kind of back it out uh but like i think to like do a lot of experimentation there's like at least thousands of dollars of tokens being consumed so it's not a it is not a uh a cheap rollout yeah but yeah

In the scheme also of how some people use tokens, it's not terrible. How many turns are you keeping in memory before you summarize? It's 30 right now. Yeah, I've tried more and less. I think like one thing you see a lot when you talk to people building agents is there's like some effective context length that actually like has the model be the smartest.

And that seems to vary slightly model by model, but, but for this model, for whatever purpose, like this 30 message worked better than 20 and better than 40. So, uh, kind of plot in between those that it worked pretty reasonably.

yeah does that change based on location like how many would you want to give it to get it out of mon moon so hey we got it we got to bring plot home we can't let him say yeah for another 57 hours i actually am not sure it does like i i've i've tried posting like i can have a ton of screenshots like 20 or 30 screenshots at a time be able to see and it's like not obvious that like that temporal concept is actually super relevant relevant to it

and again this is just like trust me as someone who has spent like a lot of hours obsessing over this uh you can try to prompt quad a lot of different ways to understand how to navigate better And anything short telling it exactly what to do does not improve its actual navigation. It's just not a skill it's great at. It's good enough to random walk its way through some of the complex mazes. And in good, easy areas, it's pretty good at popping around.

but yeah i think i i can tell you if there was like a way to prompt this slightly different that uh would navigate better and i would believe there is something but it is not like uh it is not an easy lift yeah Yeah, I just asked Cloud AI right now, how do you get through Mumbun in Pokemon Red? It does have a plan, but I don't know if it's the right.

I don't know if it's the right plan. I have seen it come up with a lot of answers to that question. And most of them. Right. This is part of the pain. When I talk about, I'm not sure if it's knowledge is better or worse. Like usually fix it. Like, oh, I know the exit is on the Eastern wall and it just like spent.

12 hours trying that um yeah yeah it's like unclear to me that that we're actually not just like harming it by having it think it knows the answer yeah i think that's the interesting part right like you don't want it to just know the answer like the model clearly knows a lot about the game there's like ev iv maxing pokemon was very very extreme but like if that's what you wanted we could just hook it up to a knowledge base like hook it up to a guide if you know how to beat pokemon red but

Yeah. The interesting piece here is actually like, can it figure out what to do without just memorizing the path? That's exactly right. Like, that's part of why. you know i i don't know part of what i've realized putting this out in the world is people will draw their line of where purity is anywhere on this spectrum like is it is this cheating like yeah maybe um who knows um

Like, frankly, like, I don't particularly care. The main insight that I have is like, when we put this out, like, you learn a lot about what the model is good and bad at by staring at it. And that's kind of what I like about it. Evaluating the model is kind of separate than your emulator and how it can use an emulator, right? Like we can always improve those things. I'm curious, as you switched from 3.5 to 3.7 in sort of reasoning models, were there any degradations there? Like did it...

Did it kind of get worse at anything? And was the prompting somewhat consistent? Like a lot of what we've seen with different reasoning models is like, you kind of prompt them differently, right? You tell them what to do, let them figure it out. But yeah, any insights there?

Yeah. Yeah, that's a good question. One thing that's nice about 3W Tones on it is it's like this hybrid reasoning model, so it kind of can do the old thing and the new thing, and it's actually pretty good at just being an out-of-the-box model. and having this like thinking mode where I can spend time reasoning. So I didn't like really run into any like serious degradations.

the one thing i'll say is like literally every model that has come out with pokemon like the the main change that i have made to this agent is deleting prompt stuff like there's a whole bunch of like band-aid prompt stuff i've added in the past it's like trying to like steer it away from doing a lot of the things that it got horribly stuck doing in the past and

as the models get better i found that just like making sure it's as simple as possible and giving them as much sort of like free reign to try to solve a problem as possible is useful and like the way i think about this is I'm like less confident over time that I understand exactly how a model is intelligent, right? Like it's capable of all of these like ridiculous things. It does PhD level stuff in some ways and like is unable to see a screen as well as a four-year-old in other ways.

But my confidence in exactly what I need to tell it to do to be smart at playing Pokemon is actually really small right now. If I tell it, this is the way you need to solve this problem, that might not actually be the best way for 3.7 to solve this problem. It's like just different than I am in terms of how it thinks about these things. I found that just like kind of like pulling some of the unnecessary instructions where I tried to like use my intuitions about what would make the model better.

out of the prompt over time is the thing that's just like sort of consistently as models got smarter gotten more juice out of this i was watching the stream yesterday or the day before and it was a very tense battle i think they were like down to like 2 hp each and like the opposing pokemon like missed a scratch or something and it didn't die and like you could tell like claw was like wow

it was like very dramatic and i was talking about the game how i yeah is there any thought being put into like trying to have it more like do you prompt it to be more rational to let it know that it's not a real life that it's a game

it's like it feels like it gets very distressed when they're actually the pokémons are actually gonna die it's funny they um it knows it's pokemon it's like you're playing pokemon red like it does know that and it has a sense of that but it clearly wrote some attachment i'll tell you a fun story

We tell it to nickname its Pokemon now. It will occasionally do without it, but it's like more fun if it nicknames its Pokemon. So that's like in the prompt is like, it's fun if you nickname Pokemon, you should consider it. And one thing we found when we started doing that is it got more protective of the Pokemon it nicknamed.

like it's pretty obvious like when it catches a pokemon now that it has a nickname it will like go heal it right away if it's hurt and that did not ever happen before which is pretty like so there's some cute little things cute quirks about quad who really wants to protect its uh precious nicknamed pokemon which is great so i will say it's kind of normal like like when i was five playing pokemon red and you know

I had two HP in the midst of scratch. That meant everything. That was existential. I agree. I agree completely. How about skilled transitioning? So one question that I had. so you're playing pokemon red right so you want to play silver or gold next um have you thought about how models can kind of learn from these games and like store these learnings and then use them again in the future I'm sure it's not part of the project today, but curious your thoughts.

i've thought about it only a little bit which is like i think there's some like interesting when you actually read one of the knowledge bases that it has gained like on some of the longer rollouts when they're good like there's actually some like pretty decent tidbits about how it should act and try and do things and like some of the ways it succeeded in And actually, one of the things that's most unique about 3.7 Sonnet that I've seen is it will have...

like meta commentary on what it's good at and bad at and its knowledge base like i misperceived this thing and so like i need to be careful doing that again you occasionally see show up there which is um which is pretty cool so like i could imagine there being some way to like

translate that knowledge base from one game to another i think my knowledge base is frankly like kind of kludgy of an implementation right now like it's like more or less a python dictionary that's appended to the prompt and I think like you could, you could find better ways if like your goal is to transfer across games and things like that to manage a knowledge base that quad can actually use more or well in different scenarios. Um, but

There's definitely pieces there that I think it would be off on a better foot on the next Pokemon game if it had that. Or even if I were to restart the stream, it would have some tidbits that it would probably speed up if it had access to things that I learned in the past.

It's interesting. Yeah. Yeah. I always think of that in card games, you know, like you have the idea of like temple in a card game and it's like, you know, it's the same magic as it is in, you know, Star Wars, flesh and blood, all these different things. I feel like games is.

similar where like learnings you get from pokemon you can bring over to similar kind of like open world games and i think it's also like particularly interesting for some of the things that are like how quad learns how to play a game in general where it's like a pressing too many buttons at once is a bad idea. Like I lost what's going on. That kind of thing. Like definitely is stuff that it has learned that is like interesting in a meta way, uh, that.

It's like hard to give it that sense of self necessarily in training. I think sometimes like it's hard for it to know like what it's getting bad at in some scenarios, but it's interesting to think about how it can learn across things. Well, like, uh, some of this also is.

you do a simulator right so a lot of what's learning is how do i use a simulator what am i good and bad at but the model internally should know quite a bit about pokemon right like if you've played pokemon going from pokemon red to emerald to diamond

Having played the first one doesn't help you that much in the second, right? You kind of get the general concept. You get what types are good against other types. And the model knows a good bit of this, right? But it's still interesting to show. This is more so like it's...

It shows that knowledge bases kind of help with understanding how to use the emulator, right? Like it struggled and then it figured it out. So, you know, with Pokemon, it's like this thing can now learn how to use them. Yeah, which is pretty cool. That has been like part of what's been fun.

seeing all my progress on this thing. I had a bit of a follow-up question to the last one with Alessio. So if people want to blow thousands of dollars and want to, you know, improve this a little bit, is there anything else that you'd want to see done? Whether that's like... improve emulator try different stuff is just anything that like anyone watching this you you kind of hint them towards what you'd want to work on what they don't work on yeah no doubt

If I had to guess, the biggest lift that exists around this is probably something around the memory, which I don't think is hyper-optimized right now. the nice thing about the memory is like it's always in the prompt like it's it doesn't go away like some sometimes if you leave it up to quad to try to like read and load and save to memory bases like it will under utilize it or forget things but i think there's probably something there

I will say all of the many, many hours I've spent tweaking around the edges of this thing, nothing quite does it like a new model though. Like fundamentally, I think the limitations right now are like some smart things. Like I've seen. And I mean this in the kindest way, but I've seen a lot of people in Twitch tell me about ways that they can fix the navigation capabilities with a better prompt.

People would be welcome to try, but I would guess that would be like a somewhat fruitless avenue. I don't think, I think it's just not very good at understanding. At the first time, I'll give you a very quick anecdote, which I think is like my favorite for like why this is particularly hard.

I have this clip of Quad leaving Oak's lab and being like, great, I left Oak's lab. Now I need to go up to the north end to go to route one. And it just like hits up on the D-pad and goes straight back into the lab. And it's like, shoot.

I'm back in the lab. I need to leave. And it hits down. It's like, great. I'm out of the lab. Now I can go up to that one. It's straight up. It just like goes up and down 12 times. And it's like, you're not, you're not fixing that with a prompt. It just literally doesn't get it.

It doesn't understand. And so it's pretty hard to make like little around the edges changes that like make a huge, huge difference. Yeah. I mean, I've always been fascinated by the fact that Twitch plays Pokemon actually beat the game. Yeah. from a you just look at it and you're like this cannot possibly work because you have people trying to sabotage it too in the chat not everybody's trying to solve it

So I just like that up. It took 16 days and 7 hours for Twitch Plays Pokemon to be red. How close do you think we are to a model that can beat it in less than 16 days? um and do you think it needs like some core like model really big jumps or like do you think it's like we're close i think i think there is model stuff at least from quad like i'm confident there's model stuff that needs to happen for it to be like really capable

I have like four spots in the game stuck in my head. It's like, I think there's literally no hope it's going to get through that. So I think there's like. a gap that's mostly around like its ability to like see and navigate and remember visually like what's going on that i just don't think is like we've figured out yet so to me that's like a pretty big gap

I do expect, I think it's going to keep getting better. I have no reason to believe that this is not just a fundamental ability to scale, learn, and understand problems thing that I think is getting better as we train models to be more capable of these long horizon tasks.

I actually do think this is a pretty reasonable proxy of that, and I think it will continue to get better for a little while. I don't know if there are affordances around images and videos and stuff like that that we need to figure out to make it work. It's unclear to me if that's true or not. But yeah, I think we have a little ways before we can beat the game in 16 days. I do not have a lot of faith that the current stream is going to be standing in Victory Road in 13 days.

what's been your favorite moment from like building this to thinking of the idea to just seeing it play any any like major highlight uh i think like the the hypest i have been is when it beat brock the first time where i was just like you know i've been doing this for eight months and then like a few weeks ago like i kick off a run wake up the next morning and it's like oh my god oh my god and it was the other good thing about it is like i woke up at 8 a.m and i checked my

i i have it send me updates to slack i'm this is like ridiculous things but um it's like literally like about to start the brock battle like i opened my phone it's like oh this is like happening right now and it's like a pretty hype way to start a day i think that was my uh

i highlight i have a lot of like other cute things like some of the cute nicknames it's done over time and things like that are are endearing but but that was like the peak hype for me it was like we beat a gym leader like we've got a badge like quads doing it you know a bit of a follow-up so i noticed that you mentioned it eventually started beating multiple gym leaders were these all the same run was it different ones was it

Yeah, I have like the run that you saw that's like on the graph we put out alongside like in our research blog is like a single run that I have watched like get through at least Serge's gym.

and then it got a little past that and the reason that that's where we stopped reporting is because that's like the physical amount of time that occurred between when i started it and when we launched the model that's like uh that was a very hyper hyper up-to-date graph on on the best run we had so awesome um i know we're running out of time my last question is are we going to work on magic on cloud place magic next

Or maybe we can do like the magic arena intro. Yeah. Uh, funny story. There was a project I did right before I joined anthropic that was like training, uh, an open source model to like slightly be better at picking draft or cards in a draft. like i was training it on like the 17 lands data that exists to like learn how to how to pick cards out of a packs a little bit better uh and i i did talk about that in my interview to get hired at anthropic so

So if I've put time into this, I'm ready. I am ready for that project too, that I have that code sitting around as well somewhere. I really get it in all my nerd ML slash, uh, slash gaming hobbies here.

yeah no i'm ready i don't know if you're planning on open sourcing any of the pokemon stuff but if you want to work in open source on the magic stuff i'll be happy to collaborate awesome we've talked about it i don't i don't know yet what the plan is there's like a certain amount of like this is not my day job that i have to figure out how i want to uh yeah and deal with that uh we'll see yeah um awesome david any parting thoughts

anything people have missed no i think like the one thing i do like to drive home when i i've been talking about this is like i really do think like this is just demonstrating like a thing that is going to make agents better with this model you know like

this is a very fun way to see it but like i think the thing is that it like has some ability to like course correct update and figure things out a little bit better than models have in the past and even if there's like stuff it's dumb at like it tends to have an ability to like power through it

in a new way and so i think what it's exciting to me is just like i think there will be some real world stuff that comes out of this model once people play with it and i'm pretty excited to see like how people take the skills we put on display a little bit here or or lack thereof in some cases and figure out how to turn them into

actual agents that do stuff i have a quick last question on that actually is there any uh guidance or any way that you like quantitatively measure the evals of this system like a lot of it is vibes a lot of it is how far it gets where it gets stuck

but like are there are there any lessons or any specifics about how you measure how it actually does so i've done a lot of like little small tests of like put it in this scenario and see what it does but i like frankly the best test i have is just like run it 10 times on diff on this configuration and like see how quickly it progresses through milestones of the game i mean it's the best thing about games right like it's why the games are such a useful thing there's literal like

benchmarks of gym badges that are moments of progress in a game which are like ways to evaluate what happens and so i think like how quickly it's able to make progress is actually a pretty reason or a reasonable like eval if a slightly expensive one to calculate it's an integration test not a unit test um awesome david thank you for joining thank you for filling in on the whole site too yeah my pleasure thank you guys i appreciate it awesome

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.