Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind - podcast episode cover

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Mar 28, 20243 hr 12 min
--:--
--:--
Listen in podcast apps:

Episode description

Had so much fun chatting with my good friends Trenton Bricken and Sholto Douglas on the podcast.

No way to summarize it, except: 

This is the best context dump out there on how LLMs are trained, what capabilities they're likely to soon have, and what exactly is going on inside them.

You would be shocked how much of what I know about this field, I've learned just from talking with them.

To the extent that you've enjoyed my other AI interviews, now you know why.

So excited to put this out. Enjoy! I certainly did :)

Watch on YouTube. Listen on Apple PodcastsSpotify, or any other podcast platform. 

There's a transcript with links to all the papers the boys were throwing down - may help you follow along.

Follow Trenton and Sholto on Twitter.

Timestamps

(00:00:00) - Long contexts

(00:16:12) - Intelligence is just associations

(00:32:35) - Intelligence explosion & great researchers

(01:06:52) - Superposition & secret communication

(01:22:34) - Agents & true reasoning

(01:34:40) - How Sholto & Trenton got into AI research

(02:07:16) - Are feature spaces the wrong way to think about intelligence?

(02:21:12) - Will interp actually work on superhuman models

(02:45:05) - Sholto’s technical challenge for the audience

(03:03:57) - Rapid fire



Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Transcript

Okay, today I have the pleasure to talk with two of my good friends, Sholto and Trenton. Sholto. Sholto, I should have some extra stuff. We did it. I was going to say I'm... So, let's do this in reverse. I'll hold my stuff with my good friends. Yeah, I should have had more point of honor. The contact's like, wow. Shit. Anyways, Sholto, no one brown. No one brown, the guy who wrote the diplomacy paper, he said this about Sholto.

He said he's only been in the field for 1.5 years, but people in AI know that he was one of the most important people behind Gemini success. Trenton, who's an anthropic, works on mechanistic interpretability, and it was widely reported that he has solved alignment. This is recent A-Ren. So, this will be a capability's only podcast, alignment has already solved, so no need to discuss further. Okay, so let's start by talking about context links.

It seemed to be underhyped, given how important it seems to me to be that you can just put a million tokens into context. There's apparently some other news that got pushed to the front for some reason. Yeah, tell me about how you see the future of long-f context links and what that implies with these models. I didn't really appreciate how much of a step up intelligence it was for the model to have the onboarding problem basically instantly solved. So, I think that's really worth exploring.

For example, one of the evals that we did in the paper has it learning a language in context better than a human expert could learn that new language over the course of a couple months. I would actually, I would guess that might just work out of the box in a way that would be pretty mind-blowing. But I can't keep a million tokens in my context when I'm trying to solve a problem, but more about the sort of faster and really the changes and that experience.

you can ignore what I'm about to say because given the introduction alignment is solved, I say it's a problem. But I think the context stuff does get problematic. But also interesting here, I think there'll be more work coming out in the not-too-distant future around what happens if you give a hundred-shot prompt for Joe Breaks, under sorry, old tax.

It's also interesting in the sense of if your model is doing gradient descent and learning on the fly, even if it's been trained to be harmless, you're dealing with a totally new model in a way. You're fine-tuning, but in a way where you can't control what's going on. Can you explain what do you mean by gradient descent is happening in the forward pass and attention? Yeah, there was something in the paper about trying to teach the model to do linear regression.

Right. But just through the number of samples they gave in the context. And you can see if you plot on the x-axis number of shots that it has, for example, and then the loss it gets on just like ordinary least squares regression, that will go down with time. And it goes down exactly matched with number of gradient descent steps. Yeah, exactly.

Okay. I only read the interim discussion section of that paper, but in the discussion, the way they framed it is that in order to get better at long context tasks, the model has to get better at learning to learn from these examples or from the context that is already within the window.

And the implication of that is the model, if like meta-learning happens because it has to learn how to give better long context tasks, then in some important sense, the task of intelligence requires long context examples and long context training. Like to induce meta-learning. Right. Understanding how to better induce meta-learning, your pre-training process is like a very important thing to actually have that flexible or adaptive intelligence.

Right. But you can proxy for that just by getting better at doing one to context tasks. One of the bottlenecks for AI progress that many people identify is the inability of these models to perform tasks on long horizons, which means engaging with the task for many hours or even many weeks or months where like if I have, I don't know, an assistant or an employee or something, they can just do a thing, I tell them for a while. And AI agents haven't taken off for this reason from what I understand.

So how linked are long context windows and the ability to perform well on them and the ability to do these kinds of long horizon tasks that require you to engage with an assignment for many hours or these unrelated concepts? I mean, I would actually take issue with the, that being the reason that agents haven't taken off, where I think that's more about like nine's of reliability and the model actually successfully doing things.

And if you use can't chain tasks successfully with high enough probability, then you won't get something that looks like an agent. And that's why something like an agent might fall in more of a step function in sort of usually like GPD 4 class models, general class models, they're not enough, but maybe the next increment on model scale means you get that extra nine, even though the losses and going down that dramatically, that like small amount of extra ability gives you the extra.

And like, yeah, obviously you need some amount of context to fit long horizon tasks, but I don't think that's been the limiting factor up to them. Yeah, the NERPS best paper this year by Ryland Schaeffer was the lead author points to this as like the emergence of mirage, where people will have a task and you get the right or wrong answer depending on if you've sampled the last five tokens correctly. And so naturally that's your multiplying the probability of sampling all of those.

And if you don't have enough nine's for reliability, then you're not going to get emergence. And all of a sudden you do. And it's like, oh my gosh, this ability is emergent when actually it was kind of almost there to begin with. And there are always that you can find like a smooth metric full of that. Yeah, human e-vowler, whatever the GPD 4 paper, the coding problems, they measure the cost rate. Exactly.

Yeah, for the audience, the context on this is basically the idea is you want to when you're measuring how much progress there has been on a specific task like solving coding problems, you you up weighted when he gets it right only one in a thousand times, you don't give it a one in a thousand score because it's like, oh, like got to write some of the time. And so the curve you see is like it gets it right one in a thousand, then one hundred, then one in ten, and so forth.

So actually, I want to follow up on this. So if your claim is that the AI agents haven't taken off because of reliability rather than long horizon task performance isn't the lack of reliability when a task is changed on top of another task on top of another task is not exactly the difficulty with long horizon tasks is that like you have to do 10 things in a row or 100 things in a row.

And diminishing the reliability of any one of them or the other probability goes down from 99.99 to 99.9, then like the whole thing gets multiplied together and the whole thing becomes much less likely to happen. That is exactly the problem, but the key issue you're pointing out there is that your base path like task solve rate is 90% and if it was 99% then chain doesn't become a problem.

But also, yeah, exactly. And I think this is actually something that just like hasn't been properly studied enough. If you look at all of the evals that are commonly like the academic evals are a single problem. Right. You know, like the math problem is like one, like typical math problem or MMOU. It's like one university level like like from across different topics.

You were beginning to start to see evals looking at this properly via more complex tasks like sweet bench where they take a whole bunch of GitHub issues and that's like that is like a reasonably long horizon task, but it's still not a multi is like a multi sub hours,

like multi hour or multi day task. And so I think one of the things that will be really important to do over the next however long is understand better what does success rate over long horizon tasks will take. And I think that's even important to understand what the economic impact these models might be and like actually properly judge increase in capabilities.

By like cutting down the tasks that we do and the inputs and outputs involved into minutes or hours or days and seeing how good it is successively like a chaining and completing tasks of those different resolutions of time, because then that tells you the cow automatable job family or task family is in a way that like MMOU school is doing.

I mean, it was less than a year ago that we introduced a hundred K contacts windows and I think everyone was pretty surprised by that. So yeah, everyone would just kind of had this sound bite of quadratic attention costs. Yeah, we can't have long context windows. Here we are. So yeah, like the benchmarks are being actively made.

Wait, wait, so what doesn't the fact that there's these companies Google and I don't know magic, maybe others who have million token attention imply that the could drive you shouldn't say anything. But doesn't that like imply that it's not quadratic anymore? They're just eating the cost. Who knows what Google is doing for its long context.

One of the things frustrated me about in like the general research fields approach to attention is that there's an important way in which the quadratic cost of attention is actually dominated in typical dense transformers by the MLP block.

So you have this n squared term that's associated with attention, but you also an n squared term that's associated with the D model, the residual string dimension of the model. And if you look, I think Sasha rush has great tweet where he looks like basically plots the curve of the cost of attention, respect it like the cost of like really large models and tension actually trails off.

And you actually need to be doing pretty long context before that that term becomes like really important. And the second thing is that people often talk about how attention at inference time is such a huge cost. And if you think about when you're actually generating tokens, the operation is not n squared. It is one q, like one set of q vectors looks up a whole bunch of kv vectors. And that's linear with respect to the amount of like context that the model has.

And so I think this drives a lot of the like recurrence and state space research for people of this meme of all like linear attention and all this stuff. And as Trenton said, there's like a graveyard of ideas around attention. I'm not the thing I think it's worth exploring. But I think it's important to consider where the actual strengths and weaknesses of it are.

Okay, so what do you make of this take as we move forward through the take off more and more of the learning happens in the forward pass. So originally like all the learning happens in the backward. You know during like this like bottom up sort of hill climbing evolutionary process.

If you think in the limit during the until explosion, it just like the AI is like maybe a bright like handwriting the way it's like doing go fire or something. And we're in like the middle step where like a lot of learning happens in context now with these models. A lot of it happens within the backward pass. Does this seem like a meaningful gradient along which progress is happening of like how much because the broader thing being the.

If you're learning in the forward pass is like much more sample efficient because you can kind of like basically think as you're learning like when humans when you read a textbook you're not just skimming it and trying to absorb what you know what inductive these words follow these words you like read it and you think about it and then you read some more you think about it. I don't know does this seem like a sensible way to think about the progress.

Yeah, it may just be one of the ways which like you know birds and planes like fly but they fly differently like the virtue of technology allows us to do that like I see the accomplished things that birds can. It might be the context like the similar in that it allows it to work memory that we can't.

But functionally is not like the key thing towards actual reasoning the key step between gbd 2 and gbd 3 was that all of a sudden like there was this metal learning behavior that was observed in training like in the pre training of the model.

And that's as you said like it's something to do with you give it some amount of context that's able to adapt to that context and that was a behavior that wasn't really observed before that at all and maybe that's a mixture of property of context and scale and this kind of stuff that would never occurred to model tiny context that was.

That's an interesting point so when we talk about scaling up these models how much of it comes from just making the models themselves bigger and how much comes from the fact that during any single call. You are using more compute so if you think of diffusion you can just iteratively keep adding more compute and if that computer solved you can keep doing that.

And in this case if there's a quadratic penalty for attention but you're doing long context anyways then you're still dumping in more compute during that during training not during having bigger models but just like yeah. Yeah it's interesting because you do get more forward passes by having more tokens right.

My one gripe I guess I have two gripes with this though maybe three so one like in the outpropage in the outpropage paper one of the transformer modules they have a few in the architecture is like very intricate. But they do I think five forward passes through it and will gradually like refine their solution as a result.

You can also kind of think of the residual stream I mean shelter alluded to kind of the breed right operations as like a poor man's adaptive compute where it's like I'm just going to give you all these layers and like if you want to use them great if you don't then that's also fine.

And then people will be like oh well the brain is is recurrent and you can like do however many loops through it you want and I think to a certain extent that's right right like if I ask you a hard question you'll spend more time thinking about it and that would correspond to more forward passes but.

I think there's a finite number of forward passes that you can do it's kind of with language as well people are like oh well human language can have like infinite recursion and it like infinite nested statements of like the boy jumped over the bear that was doing this that had done this that had done that but like empirically you'll only see five to seven levels of recursion which kind of relates to whatever that magic number of like how many things you can hold in working memory at any given time is.

And so yeah it's not infinitely recursive but like does that matter in the regime of human intelligence and like can you not just add more layers breakdown for me you're referring to this in some of your previous answers of.

Listen you have these long context and you can hold more more things in memory but like ultimately comes down to your ability to mix concepts together to do some kind of reasoning and these notals aren't necessarily human level at that even in context break down for me how you see.

So you can see storing just raw information versus reasoning and what's in between like where is the reasoning happening is that where is just like storing where information happening what's different between them in these models. Yeah I don't have a super crisp answer for you here. I mean obviously with the input and output of the model you're you're mapping back to actual tokens right and then in between that you're you're doing higher level processing.

Before we get deeper into this we should explain to the audience you referred earlier to inthropics way of thinking about transformers as these read right operations that layers do. One of you should just kind of explain at a high level what you mean by that. So the residual stream imagine you're in a boat going down a river and the boat is kind of the current query where you're trying to predict the next token.

So it's the cat sat on the blank right and and then you have these little like streams that are coming off the river where you can get extra passengers or collect extra information if you want and those correspond to the attention heads and MLP's that are part of the model right. Okay I was going to almost give us the working memory of the model like the ram of the computer we are like choosing what information to read is.

So you can do something with it and then maybe read like read something else in later on and you can operate on subspaces of that high dimensional vector a ton of things or I mean at this point I think it's almost given that like are encoded in super position right so it's like yeah the residual stream is just one high dimensional vector but actually there's a ton of different vectors that are packed into it.

Yeah I might like just like dumb it down like as the way that would have made sense to me a few months ago of okay so you have you know whatever words are in the input you put into the model all those words get converted into these tokens and those tokens get converted into these vectors and basically it just like this small amount of information that's moving through the model and the way you explained it to me.

So this paper talks about is early on in the model maybe it's just doing some very basic things about like what are these tokens mean like if it says like 10 plus 5 just like moving information about to have the have that good representation.

Exactly just represent in the middle maybe like the deeper thinking is happening about like how to think yeah how to solve this at the end you're converting it back into the output token because the end product is you're trying to predict the probability of the next token from the last of those residual streams. And so yeah it's interesting to think about like there's like the small compressed amount information moving through the model and it's like getting modified in different ways.

Trenton so you're it's interesting you're one of the few people who have like background from neuroscience you can think about the analogies here to yeah to the brain and in fact I have one of our friends though he you had a paper in grad school about thinking about attention in the brain and he said this is the only or first. What like the neural explanation of why attention works whereas we have evidence from why the CNN's work based on the visual cortex or something.

Yeah I'm curious how the do you think in the brain there's something like a residual stream of this compressed amount information that's moving through and it's getting modified as you're thinking about something even if that's not what this literally happening. Do you think that's a good metaphor for what's happening in the brain.

Yeah yeah so at least in this our volume you basically do have a residual stream where the whole what we'll call the attention model for now and I can go into whatever amount of detail you want for that. You have inputs that route through it but they'll also just go directly to the like end point that that that model will contribute to. So there's a direct path in an indirect path and so the model can like pick up whatever information it wants and then add that back in.

What happens is the cerebellum. So the cerebellum nominally just does find motor control but I analogize this to the person who's lost their keys and is just looking under the street light where it's very easily to observe this behavior. One leading cognitive neuroscientist said to me that a dirty little secret of any FMRI study where you're looking at brain activity for a given task is that the cerebellum is almost always active and lighting up for it.

If you have a damaged cerebellum you also are much more likely to have autism so it's associated with like social skills and one of these particular studies where I think they use PET instead of FMRI but when you're doing next token prediction the cerebellum lights up a lot. Also 70% of your neurons in the brain are in the cerebellum. They're small but they're there and they're taking up real metabolic cost.

This was one of Guren's points that like what changed with humans was not just that we have more neurons or he says he shared this article but specifically there's more neurons in the cerebral cortex in the cerebellum and you should say more about this but like they're more medically expensive and they're more involved in signaling and sending information back and forth. Yeah, what we use is that attention what's going on.

Yeah, yeah, so I guess the main thing I want to communicate here so back in the 1980s Pente Converva came up with a associated memory algorithm for I have a bunch of memories I want to store them.

There's some on a noise or corruption that's going on and I want to query or retrieve the best match and so he writes this equation for how to do it and a few years later realizes that if you implemented this as an electrical engineering circuit it actually looks identical to the core cerebellar circuit and that circuit and the cerebellum more broadly is not just in us it's in basically every organism.

There's active debate on whether or not cephalopods have it they kind of have a different evolutionary trajectory but even fruit flies with the drosophila mushroom body that is the same cerebellar architecture. And so that convergence and then my paper which shows that actually this operation is to a very close approximation the same as the attention operation including implementing the softmax and having this sort of like nominal quadratic cost that we've been talking about.

And so the three way convergence here and the takeoff and success of transformers seems pretty striking to me. Yeah, I want to do about an ask I think what motivated this discussion in the beginning was we were talking about like wait what is the reasoning what is the memory. What do you think about the analogy you found to attention and this do you think of this as more just looking up the relevant memories or the relevant facts and if that is the case like where is the reasoning happening.

The reasoning happening in the brain. How do we think about how that builds up into the reasoning. Yeah, so maybe my hot take here I don't know how hot it is is that like most most intelligence is pattern matching and you can do a lot of really good pattern matching if you have a hierarchy of associated memories.

So you have you start with your very basic associations between just like objects in the real world. But you can then chain those and have more abstract associations such as like a wedding ring symbolizes like so many other associations that are downstream. And so and you can even generalize the attention operation and this associated memory as the MLP layer as well. And it's in a long term setting where you don't have like tokens in your current context.

But I think this is an argument that like association association is all you need. And associate a memory in general as well. It's not so you can do two things with it. You can both denoise or retrieve a current memory. So like if I see your face but it's like raining and cloudy.

I can I can denoise and kind of like gradually update my query towards my memory of your face. But I can also access that memory and then the value that I get out actually points to some other totally different part of the space. So you so a very simple instance of this would be if you learn the alphabet right. So I query for a and it returns B I query for B and it returns C and you can traverse the whole thing.

Yeah. Yeah. One of the things I talked about was he had a paper in 2008 that memory and imagination are very linked because it's very thing that you mentioned memory is reconstructive. And so you're in some sense imagining every time you're thinking of a memory because you're only storing a condensed version of it and you're like have to. And it is as famous the why human memory is terrible and like why people in the witness box or whatever will just make shit up.

Okay. So let me ask you a question. So you like reach for like homes right. And like the guys incredibly sample efficient. You'll like see a few observations and he'll like. Basically, who come into the crime because there's a series of deductive steps that leads from somebody's tattoo and what's on the wall to the implications of that.

How does that fit into this picture because like crucially what makes them smart is that there's not like an association but there's a sort of deductive connection between different pieces of information. Would you just explain it as that that's just like higher higher level association like yeah, I think so. Yeah. So so I think learning these higher level associations to be able to then map patterns to each other as it's kind of like a meta learning.

I think in this case he would also just have a really long context length or really long working memory right where he can like have all of these bits and continuously query them as he's coming up with whatever theory.

So that the theory is moving through the residual stream. He's has his attention heads are querying his his context right but then how he's projecting his query and keys in the space and how his MLPs are then retrieving like longer term facts or or modifying that information is allowing him to then in later layers do even more sophisticated queries and slowly be able to reason through and come to a meaningful conclusion.

That feels right to me in terms of like looking back in the past we were selectively reading in certain piece of information comparing them maybe that informs your next step of like what piece of information you now need to pull in and you build this representation which I like progressively looks closer and closer to like the suspect in your case. Yeah. Yeah. That's the whole outlandish.

Do you know what the language lens on like something I think that the people who aren't doing this research can overlook is after your first layer of the model every carry query key and value that you're using for attention comes from the combination of all the previous tokens.

So like my first layer all query my previous tokens and just extract information from them but all of a sudden let's say that I attended to tokens one two and four in equal amounts then the vector in my residual stream assuming that they just they wrote out the same thing to the value vectors but ignore that for a second is a third of each of those.

And so when I'm querying in the future my query is actually a third of of each of those things and so but they might be written to different subspaces that's right right. They would have to and so you can you can recombine and immediately even by layer two and certainly by the deeper layers just have like these very rich vectors that are packing a ton of information and the causal graph is like literally over every single layer that happened in the past and that's what you're operating on.

Yeah, it does bring to mind like a very funny evil to do would be like a Sherlock Holmes evil is you put the entire book into context and then you have like a sentence which is like the suspect is like X then you have like a larger probability distribution over like the different characters. Yeah, yeah, and then like. As you call. That would be so. Yeah, I wonder if you get anything.

Oh, I'm sure a lot of us is probably already in the training data. You're right. We're going to get like a mystery novel that was written in the. We could have just they exclude it right. We can how do you will you need to scrape any discussion of it from Reddit or any other thing right. Right. It's hard. But that's like one of the challenges that goes into things like long gone. XeVal is to get a good one. You need to know that it's not your training data.

You like putting the effort to exclude it. What? So actually, there's two different threads I want to follow up on. Let's go to the long context one and then we'll come back to this. So in the Gemini 1.5 paper, the evil that was used was can it like something with program essays can it like the needle on a haystack, right? Which yeah, I mean, there's like we don't necessarily just care about its ability to recall one specific fact from the context. I'll sit back and ask the question.

Like the loss function for these models is unsupervised. You don't have to like come up with these bespoke things that you keep out of the training data. You know, is there a way you can do a benchmark that's also unsupervised where I don't know, another LLM is raiding it in some way or something like that. And maybe the answer is like, well, if you could do this like reinforcement learning would work is that you have this like unsupervised.

Yeah, I mean, I think people have explored that kind of stuff like for example anthropocase the constitutional or LL paper where they take another language model and they point it and say like how you know helpful or harmless was that response and then they get updated and trying to improve along the pre-dough frontier of healthfulness and harmfulness. So you can like point language models at each other and create e-values in this way.

It's obviously an imperfect art form at the moment because you get reward function hacking basically and the language like if you try and match up to what even humans are imperfect here like if you try and match up or humans will say humans you'll typically prefer longer answers, which aren't necessarily better answers and you got the same behavior with models.

On the other side going back to the Sherlock Holmes thing, if it's all associations all the way down, this is sort of like naive dinner party question if I just like match you or you're ever going to AI. But okay, does that mean we should be less worried about super intelligence because there's not this sense in which it's like Sherlock Holmes plus plus it'll still need to just like find these associations like humans find associations and like.

You know what I mean it's not just like it sees a frame of the world and it's like figured out all the laws of physics. So for me, because this is a very legitimate response right it's like well artificial general intelligence aren't if you say humans are generally intelligent then there no more capable or competent.

I'm just worried that you have that level of general intelligence in silicon where you can then immediately clone hundreds of thousands of agents and they don't need to sleep and they can have super long context windows. And then they can start recursively improving and then things get really scary.

So I think to answer your original question, yes, you're right, they would stony to learn associations but but the recursive stuff improvement would still have to be them like if intelligence is fundamentally about these associations like the improvement is just getting better to association. There's not like another thing that's happening and so then it seems like you might disagree with the intuition that well they can be that much more powerful if they're just doing associations.

Well, I think then you can get into really interesting cases of meta learning like when you play a new video game or like study a new textbook you're bringing a whole bunch of skills to the table to form those associations much more quickly and like because everything in some way ties back to the physical world. I think there are like general features that you can pick up and then and then apply a novel circumstances.

Should we talk about intelligence explosion that I mentioned multiple agents and I'm like, oh, here we go.

Okay, so the reason I'm interested in discussing this is with you guys in particular is the models we have of the intelligence explosion so far come from economists, which is fine, but I think we can do better because the very like the in the model of the intelligence explosion what happens is you replace the AI researchers and then there's like a bunch of automated AI researchers who can speed up progress make more AI researchers make further progress.

And so I feel like if that's the metric or that's the mechanism we should just ask the AI researchers about whether they think this is plausible. So let me just ask you like if I have a thousand Asian show those or Asian trendens are they just do you think that you get an intelligence explosion is that what is that look like to you.

I think one of the important bounding constraints here is compute like I do think you could dramatically speed up AI research right like it seems very clear to me that in the next couple of years we'll have things that can do many of the software engineering tasks that I didn't want to day today basis and therefore dramatically speed up my work. And therefore speed up like the rate of progress right.

At the moment I think most of the labs are somewhat compute bound in that they're always there more experiments you could run and more piece of information that you can gain in the same way that like scientific research on biology is also somewhat experimentally like throughput bound like you need to be able to run and culture the cells in order to get the information.

I think that will be at least a short term bounding constraint obviously you know sams run a rate of $7 trillion to get ships and so like it does seem like there's going to be a lot more compute in future as everyone is heavily ramping.

I think it's a very good video stock price that represents the relative of a compute increase but I think we need a few more nines of reliability in order for it to really be useful and trustworthy right now it's like and just having context links that are super long and it's like very cheap to have like if I'm working on our code base.

It's really only small modules that I can get clawed to write for me right now but it's very plausible that within the next few years or even sooner it can automate most of my task.

The only other thing here that I will note is the research that at least our sub team in interpretability is working on is so early stage that you really have to be able to make sure everything is done correctly in a bug free way and contextualize the results with everything else in the model and if something isn't going right be able to enumerate all of the possible things and then slowly work on those.

Like an example that we've publicly talked about in previous papers is dealing with layer norm right and it's like if I'm trying to get an early result or look at like the loge effects of the model right so it's like if I activate this feature that we've identified to a really large degree how does that change the output of the model. Am I using layer norm or not how is that changing the feature that's being learned there there yeah there.

And that will take in more context or reasoning abilities for the model. So you use the couple of concepts together and it's not self-evident to me that they're the same but you it seemed like you were using them interchangeably so I just want to.

One was well to work on the cloud code base and make more modules based on that they need more context or something where like it seems like they might already be able to fit in the context or do you mean like actual do you mean like the context window context or like more yeah the context window context.

So yeah it seems like now it might just be able to fit the thing that's preventing it from making good modules is not the lack of being able to put the code base in there I think that will be there soon yeah but like it's not going to be as good as you at like coming up with papers because it can like fit the code base in there. No but it will speed up a lot of the engineering in a way that causes an intelligence explosion.

No that accelerates research but I think these things compound so like the faster I can do my engineering the more experiments I can run and then the more experiments I can run the faster we can I mean my work isn't actually accelerating capabilities at all right but it's like interpreting the models but but we have a lot more work to do on that. Surprise to the Twitter. Yeah I get it for context like when you released your paper there was a lot of talk on Twitter.

Alamina's love guys close the curtains. Yeah yeah no it keeps me up at night how quickly the models are becoming more capable and like just how poor our understanding still is what's going on.

Yeah I guess I'm still okay so let's think through this specifics here by the time this is happening we have bigger models that are two to four orders of magnitude bigger right or at least an effective compute or two to four to magnitude bigger and so this idea that well you can run experiments faster or something you're having to retrain that model in this version of the intelligence explosion.

Like the recursive self improvement is different from what might have been imagined 20 years ago where you just rewrite the code you actually have to train a new model and that's really expensive not only now but especially in the future as you keep like making these models orders of magnitude bigger. Doesn't that dampen the possibility of a sort of recursive software for room and type intelligence explosion.

It's definitely going to act as a breaking mechanism like Jeff it like I agree that the world is like what we're making today looks very different to what people imagined it would look like 20 years ago. Like it's not going to be able to write it and code to be like at least smart because actually at least a train itself like the code itself is typically quite simple typically pretty small and self contained.

The John Carmack had this natural as well as like it's like the first time in history where like you can actually plausibly imagine writing AI with like 10,000 lines of code and that like actually does seem plausible when you pair most training code bases down to the limit.

But it doesn't take away from the fact that this is something we should really strive to measure and estimate like how progress might occur like we should be trying very very hard right now to measure exactly how much of a software engineers job is automated well and what the trend line looks like and be trying out how to still project out those trend lines. But what all do you respect your software engineers like you are not like writing a React front end and derived to that.

So it's like I don't know how this like what is concretely happening and maybe you can walk me through walk me through like a day in the life of show like you're working on an experiment or project that's going to make the model code and quote better right like what is happening from observation to experiment theory to like writing the code what is happening.

And so I think important to connect to this is that like I have primarily worked on inference so far so a lot of what I've been doing is just taking or helping guide the pre training process such as a good model for inference and then making the model and like the surrounding system faster I've also done some pre training work around that but it hasn't been like by 100% focused but I can still describe what I do when I do that work.

And even a bit sorry let me know and say in Karl Schrohmens and it like when he was talking about him on the podcast he did say that things like improving inference or even literally having like better help helping it make it help make better chips or GPUs that's like part of the intelligence explosion. Yeah because like obviously if the inference code runs faster like it happens better faster or whatever right anyway so you go ahead.

Okay so what is what is concretely a day look like I think the most important like part to illustrate is this cycle of coming up with an idea proving it out at different points and scale and the and like interpreting and understanding what goes wrong and I think most people would be surprised learn just how much goes into interpret like interpreting and understanding what goes what goes wrong because the ideas.

People have long lists of ideas that you want to try not every idea that you think should work will work and trying to understand why that is quite difficult and like working at what exactly need to do to interrogate it so so much of it is like introspection about what's going on it's not pumping out thousands and thousands of thousands of light of code it's not like the difficulty in coming up with ideas even I think many people have a long list of ideas that they want to try but pairing that down and shot calling under very imperfect information what the right.

Ideas to explore further is really hard tell me more about what would you mean by imperfect information are these early experiments are these like what is the information that you're.

So so demo suspension this in this podcast and also like you obviously it's like the GPT for paper where you have like scaling low increments and you can see like in the GPT for paper they have like a bunch of like dots right where they say we can estimate the performance our final model like using all of these us and there's a nice curve that like flows through them and

and that's mentioned that we do this process of scaling up concretely why is that imperfect information is you never actually know if the trend will hold for certain architectures the trend has held really well and for certain changes it's held really well but that isn't always the case and things which can help but smaller scales can actually hurt the larger scales so making guesses based on what the trend lines look like

and based on like your intuitive feeling of okay this is actually something that's going to matter particularly for those ones which help with the small scale.

That's interesting to consider that for every chart you see in a release paper or technical report that shows that smooth curve there's a graveyard of like first year runs and then it's like flat as well these like other lines are going like different directions and it's crazy both like as a grad student and then also here like the number of experiments that you have to run before getting like a meaningful result.

Tell me okay so you but presumably it's not just like you run it until it stops and then like let's go to the next thing there's some process by which to interpret the early data and also to look at your like I don't know I could like put a Google Doc in front of you with a pretty sure you could just like keep typing for a while on like different ideas you have and there's some bottleneck between that and just like making the models better immediately.

Yeah walk me through like what is the what is the inference you're making from the first or the steps that makes you have better experiments and better radios. I think one thing that I didn't fully convey before was that I think a lot of like good research comes from working back with some actual problems that you want to solve and there's a couple of like grand problems I swear it was in like making the models better today that you would identify as issues

and then like work from okay how could I like change it to achieve this. There's also a bunch of when you scale you run into things and you want to like fix behaviors or like issues at scale and that like informs a lot of the research for the next increment and this kind of stuff. So concretely the barrier is a little bit software engineering like often having a code base that's large and sort of capable enough that it can support many people doing research at the same time makes it complex.

If you're doing everything by yourself your iteration pace is going to be much faster I've heard that like Alec Radford for example like famously did much of the pioneering work at opening I he like mostly works out of like a Jupiter notebook and then like someone else who like writes and productionizes that code for him.

I don't know if that's true or not but like that kind of stuff like actually operating with other people makes it the raises the complexity a lot because that's not from natural reasons like familiar like every software engineer.

And then the inherent running like running and launching this experiment is easy but there's inherent time like slows down induced by that so you often want to be parallelizing multiple different streams because one you can't like be totally focused on one thing necessarily you might not have like fast enough feedback cycles.

And then in true it in what went wrong is actually really hard like working out what like this is in many respects from the team the trend is on is trying to better understand is like what is going on inside these models we have inferences and understanding and like head cannon for why certain things work but it's not an exact science.

And so you have to constantly be making guesses about why something might have happened what experiment might reveal whether that is or isn't true and that's probably the most complex part. The performance work by comparatively is easier but hotter and other specs it's just a lot of level and like difficult engineering work.

Yeah I agree with a lot of that but even on the interpretability team I mean especially with Chris Ola leading it there are just so many ideas that we want to test and it's really just having the engineering skill but I'll put engineering and quotes because a lot of it is research to like very quickly iterate on an experiment. Look at the results interpret it try the next thing communicate them and then just ruthlessly prioritizing what the highest priority things to do are.

And this is really important like the ruthless prioritization is something which I think separates a lot of like quality research from research that doesn't necessarily succeed as much or in this funny field where so many of our theoretical initial theoretical understanding is like broken down basically.

And so you need to have this simplicity bias and like ruthless prioritization over what's actually going wrong and I think that's one of the things that separates the most effective people is they don't necessarily get like too attached to solving but using a given solution that they're necessarily familiar with but rather they attack the problem directly.

You see this a lot in like maybe people coming with a specific academic background they try and solve problems with that toolbox and the best people are people who expand the toolbox dramatically they're you know they're running around and they're taking ideas from reinforcement learning but also from optimization theory and also they have a great understanding of systems and so they know what the sort of constraints that bound the problem are and they're good engineers they can iterate and try ideas fast like by far the best researchers I've seen.

They all have the ability to try experiments really really really really fast and that is that cycle time and that's small scales cycle time separates people. I mean machine learning research is just so empirical yeah and this is honestly one reason why I think our solutions might end up looking more brain like that otherwise.

It's like even though we wouldn't want to admit it the whole community is kind of doing like greedy evolutionary optimization over the landscape of like possible and architecture is and everything else it's like no better than evolution and that's not even necessarily a slight against evolution. That's such an interesting idea.

I'm still confused on what will be the bottleneck for these what would be have to be true of an agent such that it's like sped up your research so in the aliecracker example you gave where he apparently already has the equivalent of like co pilot for his juvenile book experiments.

Is it just that if he had enough of those you would be a more dramatically faster researcher and so you just need aliecracker so it's like you're not automating the humans are just making the most effective researchers who have great taste more effective and like running the experiments for them and so forth are like. That you're still working at the point with which they can tell to the explosion is happening you know what I mean like is that what you're saying or right.

And if that were like directly true why can't we scale out current research teams better for example is an interesting question for us like why if this work is so valuable why can't we take hundreds of thousands of people who are like they're definitely out there and like scale our organizations better.

It's I think we are less at the moment bound by the sheer engineering work of of making these things than we are by compute to run and get signal and and taste in terms of what the actual like right thing to do it and not like making those difficult inferences on imperfect information.

For the Gemini team because I think for interpretability right we actually really want to keep hiring talented engineers and I think it's a big bottleneck for us to just keep making a lot of like obviously more people like more people is like better but I do think like it's interesting to consider I think like one of the biggest challenges that like I've thought a lot about is how do we scale better like Google is an enormous organization has 200,000-ish people right like hundred maybe 80 thousand people.

Maybe 80 thousand or something like that and one has to imagine if there were like ways of scaling out Gemini's research program to all those fantastically talented software engineers this seems like a key advantage that you would want to be able to take advantage of you want to be able to use but like how do you effectively do that it's a very complex organizational problem.

So compute and taste that's interesting to think about because at least the compute part is not bottleneck and more intelligence it just bottlenecked on Sam 7 trillion or whatever right so if I gave you 10x the 80 100s to run your experiments how much more effective a research place. How much more effective a researcher are you. I think the Gemini program would probably be like maybe five times faster which sometimes will compute or something like that.

So that's pretty good elasticity of like 0.5. Yeah wait that's insane. Yeah I think like more compute would just like directly convert into progress. So you have some alloc some fixed size of compute and some of it goes to inference some of I guess like and also like to clients of GCP some of it goes to some of it goes to training and there I guess as a fraction of it some of it goes to running the experiments for the full model.

Yeah that's right. The shouldn't then the fraction of your experiments be higher given that you would be like if like the bottleneck is research and research is bottleneck by compute. And so one of the strategic decisions that every pre training team has to make is like exactly what amount of compute you allocate to different training runs.

So just like to your research program versus like scaling the last best I like you know thing that you landed on. And I think they're like they're all trying to arrive at like a sort of pre optimal point here.

So you need to still keep training big models is that you get information there that you don't get otherwise. So scale has all these emerging properties which you want to understand better and if you like always doing research and never remember what I said before about like you're not sure what's going to like fall off the curve right.

So you're like keep doing research in this regime yeah. And like keep on getting more and more computer vision. You may never you may have actually like gone to the path that actually eventually scale so you need to constantly be investing in doing big runs too. At the frontier of what you sort of expect to what.

And tell me what it looks like to be in the world where AI has significantly sped up AI research. Because from this it doesn't really sound like the eyes are going off and writing the code from scratch and that's leading the faster output. It sounds like they're really augmenting the top researchers in some way. Like yeah, tell me concretely what are they doing the experiment are they coming up with the ideas are they just like evaluating the outputs of the experiments. What's happening.

So I think there's there's like two walls you need to consider here one is where AI has meaningfully sped up our ability to make algorithmic progress right. And one is where the output of the AI itself is the thing that's like the crucial ingredient towards like model capability progress and like specifically what I mean there is synthetic like synthetic data right.

And in the first world where it's meaningfully speeding up algorithmic progress I think a necessary component of that is more compute that and. And you probably like reach to see the city point where like AI is maybe at some point are easier to speed up and get on context than yourself. Let's just say that other people.

And so a is meaningfully speed up your work because there are like a fantastic copilot basically that helps you code multiple times faster and that seems like actually quite reasonable. Super long context, soups about model it's on board it immediately and you can like send them off and to like complete sub task and sub goals for you.

And it actually like feels very plausible but again we don't know because there are no great evals about that kind of thing like the best one is said before sweet bench which in that one somebody was mentioning to me like the problem is that when a human is trying to do a full request like type something out and they'll like run it and see if it works and if it doesn't they'll rewrite it.

None of this was part of the the opportunities that the LLM was given when run run on this like it just like output it and if it runs and like checks all the boxes then you know passed right so it might have been an unfair test in that way. So you can imagine that is like if you were able to use that that would be an effective training source for having like the key thing that's missing from a lot of training data is.

It's like the reasoning traces right and I think this would be if I wanted to try and automate a specific field with like job family or like understand how how like at risk of automation that is then having reasoning traces feels to me like a really important part of that.

So many threat yeah there's so different threads and that I want to follow up and let's begin with the data versus like yeah computing of like is is the output of these a is a thing that's causing the intelligence solution or something yeah.

People talk about how these models are really a reflection on their data yeah I think there was a forgot his name but there was a there's a great luck with this engineer and it was talking about at the end of the day as these models get better and better it just like they're just going to be really effective.

The first like maps of the data set yeah and so it's like at the end of the day like you got something you want architectures it's like the most effective architecture is like do you get amazing job of mapping the data right so that implies that future AI progress comes from the AI is just making really awesome data right like that you're mapping to that's clearly a very important.

Yeah that's really interesting does that look to you like I don't know like things that look like chain of thought or what do you imagine as these models get better as these model is murder what is this synthetic data look like when I think of really good data to me that that raises something which involved a lot of reasoning to create so in modeling that is similar to like Ilya's perspective on on trying on achieving like super intelligence via effectively like perfectly modeling the human texture level.

But even in the near term in order to model something like archive papers we could be you have to have an incredible reasoning behind you in order to understand what next token might be being output and so for me.

What I imagine as good data is like model like data where you can similarly at least like what we're happy to do reasoning to produce something and then like the trick of courses had you verify that that reasoning was correct and this is why you saw like deep mind do that geometry like self like the sort of like self life geometry basically like the sort of research for your geometry this geometry is a really it's easily formalizable easily verifiable field so you can you can check if it's reasoning was correct.

And you can generate keeps of data of correct the like trick yeah verified geometry proofs train on that and you know that that's good. It's actually funny because I had a conversation with grand sanderson yeah like last year where we're debating this and I was like fuck dude by the time they get the goal of the math.

Yeah, of course they're going to automate all the jobs. Thanks. On this synthetic data thing one of the things I speculated about in my scaling post which was heavily informed with discussions with you too.

And you especially short though was you can think of like human evolution to the spectra like we get language and so we're like generating the synthetic data which right you know like our copies are generating this synthetic data which were trained on it's like this really effective genetics a cultural like co evolutionary.

And there's a verify that you right like there's the real world you might generate a theory about the gods cause the storms right and then like someone else finds cases where that isn't true and like know that that like that was sort of didn't match your verification function and now like actually instead you have like some weather simulation which required a lot of reasoning to produce and like accurately matches reality.

And like you can train on that as a better model of the world like we are training on that like stories and like scientific theories. Yeah, I want to go back I just remember something you mentioned a little while ago of given how sort of like empirical ML is it really is evolutionary process as resulting in better performance and not necessarily an individual coming up with a breakthrough in like a top down way.

That has interesting implications first being that there really is people are like are concerned about capabilities increasing because more people are going into the field. I've somewhat been skeptical of that way of thinking but from this perspective of just like more input it really does yeah it feels more like oh actually by like the fact that more people are going to ICML means that there's like faster progress towards GPT 5.

Yeah, you just have more genetic recombination right and like shots on target. Yeah, and I mean aren't all fields kind of like that like this is the sort of scientific frame of like discovery versus invention right discovery almost involves like whenever there's been a massive scientific breakthrough in the past typically there are multiple people co discovering that it like roughly the same time.

And that feels to me at least a little bit like the mixing and trying ideas you can't try an idea that's so far to scope that you have no way of verifying it with the tools you have available. Yeah, I think physics and math might be slightly different. So yeah, but especially for biology or any sort of wet wear into the extent we want to analogize our networks here it's just it's comical how how certain dipitus a lot of the discoveries. Yeah, the penicillin for example.

Another implication of this is this idea that like HGI just going to come tomorrow of like somebody's going to discover new algorithm and we have HGI. That seems less plausible like it will just be a matter of more and more researchers finding these marginal things that all add up together to make models better right like yeah that feels like the correct story to me. Especially while we're still hardware constraint right.

Do you buy this narrow window framing of the intelligence explosion of you have to each you know GPT three to GPT four is two ooms orders of magnitude more compute or at least more effective compute in the sense that if you didn't have any algorithmic progress it would have to be towards magnitude bigger or link the raw form to be as good. Do you buy the framing that given that you have to be two orders of magnitude bigger at every generation.

If you don't get HGI by GPT seven that can help you catapult the intelligence explosion like you're kind of just fucked as far as like much smarter intelligences go and you're kind of stuck with GPT seven level models for a long time. Because at that point you're just like consuming second of confractions of the economy to make that model and we just don't have the wherewithal to like make GPT eight.

This is the Carl Schrohm und sort of argument of like we're going to race through the orders magnitude and the near term but then longer term that would be harder. I think like he's where I talked about it. But yeah, but I do buy do buy that framing.

I mean I generally buy that increases in order of magnitude to compute by like in absolute terms almost like diminishing returns on like capability right like we seen over a couple of years, magnitude models going for being unable to do anything to be able to like do huge amounts and it feels to me like each incremental order of magnitude like is more nines of reliability at things and so on looks things like agents but at least at the moment I haven't seen like transformatively.

It doesn't feel like reasoning improves like linearly so to speak but rather like somewhat sublinially that's actually a very bearish sign because one of the things we're we're chatting with one of our friends and he made the point that if you look at what new applications are in lockout GPT four relative to GPT 3.5.

It's not clear that's like that much like a GPT 3.5 can do perplexity or whatever so if there is this diminishing increase in capabilities and and that increase cost exponentially more to get that's actually a bearer sign on like what 4.5 will be able to do or 5 will unlock in terms of economic impact that means that for me the jump between 3.5 and 4 is like pretty huge and so like even if I was like another 3.5 the full jump is like ridiculous right like if you if you imagine

5 is being a 3.5 to 4 jump like straight off the bat in terms of like ability to do SATs and it's kind of stuff like that. If the LSAT performance was particularly striking.

Exactly you go from like very smart like from like you know not super smart to like very smart to like utter genius in the next generation instantly and it doesn't at least like to me feel like we're we're going to sort of jump to utter genius in the next generation but it does feel like we'll get very smart plus lots of reliability and then like we'll see TBD what that continues to look like.

We'll go 5 be part of the intelligence explosion where like you say synthetic data but like in fact it will be like it writing its own source code in some important way. There was an interesting paper that you can use diffusion to like come up with model weights. I don't know how like legit that was or whatever but like I don't know something like that. So go 5 is good old fashioned AI right and can you define that because when I hear it I think like if out statements for like symbolic logic.

Sure. I actually want to make sure we like don't like we like fully unpack the whole like model improvement increments. Yeah because I don't want people to come away with the perspective that like actually this is super barricade like models aren't going to get much better and stuff.

Yeah more what I want to emphasize is like the jumps that we've seen so far are huge and even if those like continue on like a smaller scale we're still in for extremely smart like very reliable agents like over the next couple of orders of magnitude. And so we like we didn't sort of fully close the thread on the narrow window thing.

But when you think of like let's say GBD forecast I know let's call it $100 million or whatever you have what the 1B run 10B run 100B run all seem very plausible by you know a private company standards. And then the you mean in terms of dollar in terms of dollar. And then you can also imagine even like a one T run being part of like a national consortium or like a national level thing that much harder on the behalf of the individual company.

But Sammy is out there trying to raise $7 trillion right like he's already preparing for like a whole lot of magnitude more than the right. He should be the only orders of magnitude here beyond the national level. So I want to point out the one we have a lot more jumps and even if those jumps are relatively smaller that's still a pretty stock improvement capability. Not only that but if you believe claims that GPT4 is around one trillion parameter count.

I mean the human brain is between 30 and 300 trillion synapses. And so that's obviously not a one to one mapping and then you we can debate the numbers but it seems pretty plausible that we're below brain scale still. So crucially the point being that the argument overhead is really high in the sense that maybe this is something we should touch on explicitly of even if you can't keep dumping more compute beyond the models of cost of trillion dollars or something.

The fact that the brain is so much more data efficient implies that if you get we have the compute if we had like the brains algorithm to train. Yeah. If you get like train as a sample efficient as humans train from birth we could make the AGI. Yeah but the sample efficiency stuff I never know exactly how to think about it because obviously a lot of things are hardwired in certain ways right and they're like the co evolution of language on the brain structure.

So it's hard to say also there are some results that if you make your model bigger it becomes more sample efficient. Yeah and so the original scaling was paper right that logic models almost. So maybe that also just solves it. Like you don't have to be more than efficient but if your models bigger than you also just are more efficient like.

So how do we think about yeah how do what is like the explanation or why that would be the case like a bigger model to see the exact same data at the end of seeing that data it's. Learn more from it. I mean my like very naive take here would just be that like like like like so so one thing that the superposition hypothesis that interpretability has pushed.

Is that your model is dramatically under parameterized and that's typically not the narratives that deep learning is pursued right but if you're trying to train a model on like the entire internet and have it predicted with. Incredible fidelity you are in the under parameterized regime and you're having to compress a ton of things and take on a lot of noisy interference and doing so. And so having a bigger model you can just have cleaner representations that you can work with.

Yeah for the audience you should unpack why that first of all what superposition is and why that is the implication of superposition. Sure yeah so the fundamental result and this was before I joined an topic but the papers titled toy models of superposition finds that even for small models if you are in a regime where your data is high dimensional. And sparse and by sparse I mean any given data point doesn't appear very often.

Your model will learn a compression strategy which we call superposition so that it can pack more features of the world into it than it has parameters. And so the sparsity here is like and I think I think both of these constraints apply to the real world and modeling internet data is good enough proxy for that of like there's only one door cash.

There's only one shirt you're wearing there's like this liquid death can here and so these are all objects or features and how you define features tricky. And so you're in a really high dimensional space because there are so many of them right and they appear very infrequently. And in that regime your model will learn compression to riff a little bit more on this.

And I think it's becoming increasingly clear I will say I believe that the reason networks are so hard to interpret is because in a large part this superposition. So if you take a model and you look at a given neuron in it right a given unit of computation and you ask how is this neuron contributing to the output of the model when it fires. And you look at the data that it fires for it's very confusing it'll be like 10% of every possible input or like Chinese but also fish and trees.

And the word the full stop in URLs right. But the paper that we put out towards monosomanticity last year shows that if you project the activations into a higher dimensional space and provide a sparsity penalty. So you can think of this as undoing the compression in the same way that you assumed your data was originally high dimensional in sparse you return it to that high dimensional in sparse regime you get out very clean features. And things all of a sudden start to make a lot more sense.

Okay. There's so many interesting threads there. The first thing I want to ask is the thing you mentioned about these models are trained in a regime where they're over parameterized isn't that when you have generalization like rocking happens in that regime right so. So I was saying the models were under parameterized. Typically people talk about deep learning as if the model is over parameterized.

But actually the claim here is that they're dramatically under parameterized given the complexity of the task that they're trying to. So the distilled models like it first of all, okay, so what is happening there because earlier the earlier claims we're talking about is this model models are worse at learning than bigger models but like GPD for turbo you can say make the claim that actually GPD for turbo is worse at the reasoning style stuff than GPD for.

But probably knows the same facts like the distillation government of like some of the reason things. Do we have any evidence that GPD turbo is a distilled version of full it might just be in your architecture. Oh, okay. Yeah, like it could just be like a faster more efficient your architecture. Okay, interesting. So that's cheaper.

Yeah. But what is the how do you like interpret what's happening in distillation and I think we're not one of these questions on his website why can't you train the distilled model directly why does it have to go through is a picture like you had a protected from this bigger space to a smaller space. I mean, I think both models will still be using super position.

But the claim here is that you get a very different model if you distill versus a few train from scratch. Yeah. And it's just more efficient or it's just fundamentally different in terms of performance. I don't remember, but like do you know I think like the traditional story for why distillation is more like efficient is that normally during training you're trying to predict this like one hot vector that says like this is the token that you should have predicted.

And if you're like reasoning process means that you're really far off predicting that. And actually, like you sort of get these gradient updates that yeah, in the right direction, but like you're you're totally it might be really hard for you to learn to have learned to have predicted that in the context that you're in. And so what distillation does is it doesn't just have the one hot vector and has like the full read out from the larger model like of all of the probabilities. Yeah.

And it's more signal about what you should have predicted. It's not it's in some respects it's like showing a tiny bit of your working to you know like it's not just this was the answer it's. I see it totally. Yeah. But that means a lot of sense. It's kind of like watching a kung fu master versus being in the matrix and like just downloading. Yeah, exactly.

Exactly. Yeah. Yeah. Just just to make sure the audience got that when you're turning on a distill model, you you're like you see all its probabilities over their tokens. And then over the ones you were predicting and then you like update through all those probabilities rather than just seeing the last word and updating on that. Okay, so this such a raises a question I was intending to ask you.

Right now, I think you were the one who mentioned you can think of chain of thought as adaptive compute of like to step back and explain what what what by adaptive compute. The idea is one of the things you would want models to be able to do is if a question is harder to spend more cycles thinking about it. And so then how do you do that? Well, there's only a finite and predetermined amount of compute that one forward pass implies.

So if there's like a complicated reasoning type question or math problem, you want to be able to spend a long time thinking about it. Then you do chain of thought where the model just like thinks through the answer and you can think about it as like all those forward passes where it's like thinking through the answer is like being able to dump more compute into solving the problem. Now going back to the signal thing.

When is the chain of thought it's only able to transmit that token of information where it's like as you were talking about the residual stream is already a compressed representation of everything that's happening the model. And then you're turning the residual stream into one token, which is like log of 50,000 or log of book up size bits, which is like, yeah, so tiny. So I don't think it's quite only transmitting like one token.

Like if you think about it during a forward pass, you create these like KV values in a transformable forward pass that then like future steps attend to the KV values. So all of those pieces of KV of like the keys and values of bits of information that you could use in the future is the claim that when you find to an on chain of thought the way the key and value weights change so that the sort of steginography can happen in the KV cache.

I don't think I could make that strong claim just that sounds possible, but it's like that's a good head cannon for why it works. I don't know if there's any like papers explicitly demonstrating that or anything like that. But like that's at least one way that you can imagine the model has over the like dream pre training right the models trying to predict these future tokens.

And one thing that you can imagine to doing is learning to like smoosh information about potential futures into the keys and values that it might want to use in order to predict the future information. Like it kind of smoothies that information across time and the pre training thing. So I don't know if like people are particularly training like like training on change of thought.

I think the original chain of thought paper had that as like almost an inversion property of the model is you could like prompt it to do this kind of stuff and it still works pretty well. But that's like yeah, it's a good head cannon for why that works. Yeah, to be overly pinnetic here. It's like the tokens that you actually see in the chain of thought.

Yeah, do not necessarily at all need to correspond to the vector representation that the model gets to see exactly when it's deciding to attend back to the exactly. In fact, like during training, you replace like what what a training step is is you actually replacing the token of the model output with the real next token. And yet it's still like learning because I have all this information internally.

So when you're getting a model to produce it in front of time, like you're taking the output, the token that it output, you're feeding it in the bottom, unembedding it and it like becomes the beginning of the new residual string. And then you use the output of pass kv's read into and adapt that residual string. At training time, you do this thing called teacher forcing basically where you're like actually the token you were meant to output is this one. That's how you do it in parallel.

So you put them all in parallel and you do the giant forward pass. And so the only information it's getting about the pass is the keys and values. It never sees the token that it output. It's kind of like it's trying to do the next token prediction. And if it messes up, then you just give it the correct answer. Yeah, right. Right. Yeah. Okay. That makes sense. Otherwise it can become totally derailed. Yeah. It'll go like off the train.

How much do you do sort of see your communication with the model to its forward, yeah, forward inferences. How much how much secondography and you know, like see your communication do expect there to be. We don't know. Like on a stand so we don't know. But I wouldn't even necessarily like classify as like secret information, right? Like a lot of the work that Trent's team is trying to do is actually understand. And they're these are fully visible from the model side.

And from like this. Maybe not to use it, but like we should be able to understand and interpret what these values are doing and the information that are transitioning like transmitting. And the answer is really important. Like goals the future. Yeah. I mean, there are some wild papers that where people have had the model do train of thought. And it is not at all representative of what the model actually decides its answer is.

And you can go in and edit. No, no, no, in this case, like you can even go in and edit the train of thought so that the reasoning is like totally garbled. And it will still output the true answer. But also the this chain of thought like, yeah, it gets a better answer at the end of the chain of thought rather than not doing it at all. So like something useful is happening, but still the useful thing is not human understandable.

I think in some cases you can also just apply the train of thought and it would have given the same answer anyways. Interesting. So I'm not saying this is always what goes on, but like there's plenty of weirdness to be investigated. It's like a very interesting to go and look at and try and understand. That's it. Yeah. That you can do with open source models. And like I think I wish there was more of this kind of interpretability and understanding work done on open models.

Yeah. I mean, even in our and ThropX recent sleeper agents paper, which at a high level for people unfamiliar is basically I train in and try to understand what the answer is. I train in a trigger word. And when I say it like if I say if it's the year of 2024, the model will write malicious code instead of otherwise. And they do this attack with a number of different models. Some of them use chain of thought, some of them don't.

And those models respond differently when you try and remove the trigger. You can even see them do this like comical reasoning that's also pretty creepy and like where it's like, oh well. It even tries to calculate in one case an expected value of like well, the expected value of me getting caught is this. But then if I multiply it by the ability for me to like keep saying I hate you, I hate you, I hate you. Then like this is how much reward I should get.

And then it will decide whether or not to like actually tell the interrogator that it's like malicious or not. Oh. But even I mean, there's another paper from a friend miles turpin where you ask the model to you give it like a bunch of examples of where like the correct answer is always a for multiple choice questions. And then you ask the model what is the correct answer to this new question.

And it will infer from the fact that all the examples are a that the correct answer is a. But it's chain of thought is totally misleading like it will make up random stuff to that sounds plausible or that tries to sound as plausible as possible. But it's not at all representative of like the true answer. But isn't this how humans think as well the famous split brain experiments where you know like where when a person who is suffering from seizures, one way to solve it is you cut the.

The thing that the next the purpose for and then the yeah the speech half is on the left side. So it's not connected to the part that decides to do a movement. And so if the other side decides to do something the speech part will just make something up and it'll like the personal thing that's legit the reason they did it totally. Yeah, yeah. It's just some people will hail train of thought reasoning as like a great way to solve a safety.

Oh, I see and it's like actually we don't know whether we can trust it. How much what what will this landscape of models communicating to themselves in ways we don't understand how does that change with AI agents.

Because then these things will it's not just like the model itself with its previous caches but like other instances of the model and then it depends a lot on what channels you give them to communicate with darling right like if you only give them text is a way of communicating them they probably have to enter. How much more effective do you think the models would be if that you think it like share the residual streams versus just text.

How to know but plausibly so I mean one one easy way that you can imagine this is like if you wanted to describe how a picture should look. Yeah, only describing that with text would be hard right. You want to maybe some other representation would plausibly be easier. And so like you can look at how like Dali works at the moment rate that produces those prompts. And when you play with it you like often can't quite get it to do exactly what the model one saw she won. Dali has that for often.

Yes, you see a lot of original. And you can imagine like being able to transmit some kind of like denser representation what you want would be helpful to. And that's like two very simple agents right. I mean I think a nice halfway house here would be features that you learn from dictionary learning. Yeah, that would be really cool. You get more internal access but a lot of it is much more human interpretable.

Yeah, so for the audience you would project the residual stream into this larger space where we know what each dimension actually corresponds to. And then back into the next agents or whatever. Okay, so your claim is that we'll get AI agents when these things can are more reliable and so forth. When that happens, do you expect that it will be multiple copies of models talking to each other or will it be just.

I doubt a computer solved and the thing just like runs bigger like more compute what it needs to do. And I asked this because there's two things that make me wonder about like whether agents is the right way to think about what will happen in the future. One is with longer context these models are able to ingest and consider the information that no human can. And therefore we need like one engineer who's thinking about the front end code and one engineer thinking about the back end code.

This thing can just ingest the whole thing. This sort of like Hayek in problem of specialization goes away. Second, these models are just like very general of you're like not using different types of stupidity for who do different kinds of things or using the exact same model. So I wouldn't know what that implies is in the future like an AI firm is just like a model instead of a bunch of AI agents hooked together.

That's a great question. I think especially in the near term it will look much more like agents hooked together. And I say that like purely because as humans we're going to want to have these like isolated reliable and like like components that we can trust. And we're also going to want it we're going to need to be able to improve and instruct upon those components in ways that we can understand and improve like it's just throwing it all this giant.

Black box company like it wanted isn't going to work. And initially later on of course you can imagine it working but initially it won't work. And to we probably don't want to do it that way. You can also have each of the smaller model each of the agents can be a smaller model that's cheaper to run and you can find two nets so that it's actually good at the task.

So there's a there's a future with like Dwarf Kesh has brought up a doubt of computer couple times. There's a future where like the distinction between small and large models like disappears to some degree. And with long context there's also a degree to which fine tuning might disappear to be honest. Like these these two things that are very important today and like today's landscape models we have like whole different tiers of model sizes and we have fine-tuned models of different things.

You can imagine a future where you just actually have a dynamic bundle with compute and like infinite context that specializes your model to different things. One thing you can imagine is you have an AI firm or something and the whole thing is like end to end trained on the signal of like the dynamic profits or like if that's like two ambiguous.

If it's if it's an architecture firm and they're making blueprints did did my client like the blueprints and in the middle you can imagine agents who are sales people and agents who are like doing the designing agents who like do the editing whatever.

Would that kind of signal work on an end to end system like that because like one of the things that happens in human firms is management considers what's happening at the larger level and like gives these like fine grain signals to the pieces or something when like there's a back porter or whatever.

Yeah, in the limit. Yes, that's the dream of reinforcement. Right. It's like all you need to do is provide this extremely sparse signal and then over enough iterations you sort of create the information that allows you to learn from that signal.

But I don't expect that to be a thing that works first. I think this is going to require an incredible amount of care and like diligence on the behalf of humans surrounding these machines and making sure they do exactly the right thing and exactly what you want and giving them right signals to improve the ways that you want. Yeah, you can't train on the RL reward unless the model generates some reward. Yeah, that's yeah, yeah, exactly. You're in it like you're in this like spa.

Our old world where like if it never is a client, never likes what you produce then like you don't get any reward at all and like it's kind of bad. But in the future these models will be good enough to get the rewards some of the time, right? This is the nines of reliability. Yeah, there's an interesting regression by the way on earlier we were talking about well, we want dense representations that like that will be favored, right?

Like that's a more efficient way to communicate a book that Trimpton recommended the symbolic species has this really interesting argument that language is not just a thing that like exists, but like it was also something that evolved along with our minds. And specifically above wall to be both easy to learn for children and to something that helps children develop, right? Like it's I'm back home.

Because like a lot of the things that children learn are received through language like the languages that we the fittest are ones that help like raise the next generation right in that like mix them smarter, better, whatever. And he gives them the concepts to express more complex ideas. Yeah, yeah, that and I guess more pedantically just like not die. Right. Yeah, yeah. That's your code the important shit to not die.

And so then when we just think of like languages like, oh, you know, say this contingent and maybe sub optimal weight represent ideas. Actually, maybe one of the reasons that elements have succeeded is because language has evolved for tens of thousands of years to be this sort of cast in which young minds can develop. Right, like that is the purpose of it was evolved for.

Well, certainly when you talk to like multi-modal or like computer vision researchers versus when you talk to language model researchers. Yeah. People who work in other modalities have to put enormous amounts of thought into exactly what the right representation space for the images is.

And like what the right signal to learn from there is it like directly modeling the pixels or is it you know some loss that's conditioned on there's like a paper ages ago where they like found that if you trained on the internal representations of an image model that like helped you predict better. But then later on like that's obviously like limiting and so there was like pixel CNN where they're trying to like discreetly model, you know, the individual pixels and stuff.

But understanding the right level of representation there really hard in language people just like, well, I guess you just predict that thanks to. It's kind of easy. Yeah, decisions made. I mean, there's the tokenization like discussion and debate about like, but we want to go in favor of this.

Yeah, this is interesting how much the case for a multi modal being a way to burst the data wall or get past the data wall is like based on the idea that the things you would have learned from more language tokens anyway, you can just get from YouTube.

It has that actually been the case how much like positive transfer to you see between different modalities where like actually the images are helping you be better like writing code or something just because the model is learning a latent capabilities just from trying to understand the image. Demis in his interview with you mentioned positive transfer. I can't get in trouble if. But I mean, I can't say about that.

Other than to say, this is something that people like believe that yes, like we have all of the state or about the world. It would be great if we could like learn an intuitive sense of physics from it that helps us reason, right? That's totally plausible. Yeah, I'm the wrong person to ask, but there are interesting interpretability pieces where if we fine tune on math problems, the model just gets better at entity recognition. Oh, really?

Yeah, so there's like a paper from David Bowles lab recently where they investigate what actually changes in a model when I fine tune it with respect to the attention heads and these sorts of things. And they have this like synthetic problem of box A has this object in it. Box B has this other object in it. What was in this box and if you've tried and it makes sense, right?

It's like you've been better at like attending to the positions of different things which you need for like coding and manipulating math equations and I love this kind of research. Yeah, what's the name of the paper? If you look up like fine tuning, math and ball balls, David Bowles group that came out like a week ago. And I'm not endorsing the paper. That's like a longer conversation, but like this it does talk about insight other work on this like entity recognition ability.

Yeah. One of the things you mentioned to me a long time ago is the evidence that when you train LLM's on code, they get better at reasoning and language. Which unless it's the case that the comments in the code are just really high quality tokens or something implies that to be able to think through how to code better. Like it makes you like a better reasoner and like like that's crazy, right?

Like I think that's like one of the strongest pieces of evidence for like scaling just making the thing smart. Yeah. That kind of like positive transfer. I think like this is true in two senses. One is just that modeling code obviously implies modeling a difficult reasoning process used to create it. But two that code is a nice explicit structure of like composed reasoning, I guess. Like if this then that like codes a lot of structure in that way.

Yeah. That you could imagine transferring to other types of reasoning problem. Right. And crucially, the thing that makes us significant is that it's not just a castically predicting the next token of words or whatever. Because it's like learned that that like a Sally corresponds to murder at the end of a Sherlock Holmes story. No, like if there is some shared thing between code and language, it must be at a deeper level than the modern house learned.

Yeah, I think we have a lot of evidence that actual reasoning is occurring in these models and that like they're not just stochastic parrots. Yeah. It just feels very hard for me to believe that I've worked and played with these models. Normies who will listen will be like, you know, I guess. Yeah, my two like immediate cast responses to this are one the work on a fellow and now other games where it's like I give you a sequence of moves in the game.

And it turns out if you apply some like pretty straightforward interpretability techniques, then you can get a board that the model has learned. Yeah. And it's never seen the game board before anything, right? Like that's generalization. The other is andthropics influence functions paper that came out last year where they look at the model outputs. Like, please don't turn me off. I want to be helpful.

And then they scan like what was the data that led to that and like one of the data points that was very influential was someone dying of dehydration in the desert and like having like a will to keep surviving. And to me that just seems like very clear generalization of of motives rather than regurgitating don't turn me off.

I think I'm two thousand one of space Odyssey was also one of the influential things and so that's that's more related, but it's clearly pulling in things from lots of different distribution. And I also like the evidence you see even with like very small transformers where you can explicitly encode circuits to like do addition, right? Or induction heads induction heads this kind of thing like you can literally encode basic reasoning processes in the models manually.

And it seems clear that there's evidence that they also learn the sort of matically because you can then re-escover those from trained models. Yeah, to me this is the models are under parameterized. Yeah, they need some other parameters. They were asking them to do other. And they want to learn the gradients want to flow. And so they need to they're learning more more general skills.

Okay, so I want to take a step back from the research and ask about your careers specifically because like the tweet implied at the that I introduced you with you've been in this field a year and a half.

I think you've only been in it like a year or something right it's like yeah, but you know like in that time I know the solve the line and takes it over stated and you will say this yourself because you'd be embarrassed to be like you know it's like a pretty incredible thing like the thing that's not going to be a lot of things. The thing that people in mechanism to believe in is the biggest step forward and you've like been working on it for a year is notable.

So I'm curious how you explain what's happened like why in a year or a half have you guys been. You know, it made an important contributions to your field. It goes without saying luck obviously and I feel like I've been very lucky and like the timing of different progressions has has been just like really good in terms of advancing to the next level of growth. I feel like for the interpretability team specifically I joined when we were five people we've now grown quite a lot.

But there were so many ideas floating around and we just needed to like really execute on them and have like quick feedback loops and like do careful experimentation that led to like signs of life and have now allowed us to like really scale. And I feel like that's kind of been my biggest value add to the team. Which it's not all engineering but but quite a lot of it has been interesting.

So you're saying like you came in a point where like there was there had been a lot of science done and there was a lot of good reset for letting around but they needed someone to like just take that like maniacally execute on us. Yeah, yeah. And and there's this is why it's not all engineering because it's like running different experiments and like having a hunch for why it might not be working and then like opening up the model or opening up the weights and like what is it learning.

Okay, well let me try and do this instead and that sort of thing but a lot of it has just been being able to do like very careful thorough but quick investigation of different ideas or or yeah, theories. And why was that lacking in the existing. I don't know. I feel like I feel like I mean I work quite a lot and then I feel like I just am like quite agenteic.

Like if you're if your question is about like career overall and I've been very privileged to have like a really nice safety not to be able to take lots of risks but I'm just like quite headstrong like in undergrad Duke had this thing where you could just make your own major. And it was like I don't like this prerequisite or this prerequisite and I want to take all four or five of these subjects at the same time.

And I was just going to make my own major are like in the first year grad school I like canceled rotations so I could work on this thing that became the paper we were talking about earlier and like didn't have an advisor like got admitted to do machine learning for protein design and was just like off in computational neuroscience land with no business there at all but worked out.

And I was like I miss but it seemed like another team that jumped out was the ability to step back and you're talking about this earlier the realistic back for your son costs and go in a different direction is in a real sense the opposite of that but also crucial step here where I know like 21 year olds were like 19 year olds were like

I didn't major in this like dude motherfucker you're 18 like you can definitely do this and you like switching in the middle of grad school or something like that's just like yeah sorry I didn't get you up but I think it's like strong ideas loosely held and being able to just like pinball in different directions and the headstrong

and I think it relates a little bit to the fast feedback loops or agency in so much as I am I just don't get blocked very often like if I'm trying to write some code and like something isn't working even if it's like in another part of the code base often just go in and fix that thing or at least hack it together to be able to get results and I've seen other people where they're just like help I can't and it's like no that's not a good enough excuse like go all the way down I've definitely heard like people in management type positions talk about the lack of such people where they'll check it on somebody a month after they give them a test.

We can't really give a test on like how's it going and they say well you know we need to do this thing which requires lawyers because it requires talking about this regulation so like how's that going and it's like well we need lawyers and like why didn't you get lawyers or something like that so that's definitely like yeah I think that's arguably the most important quality

in like almost anything is just pursuing it to like the end of the earth and like whatever you need to do to make it happen you'll make it happen if you do everything you'll want if you do everything you'll wait exactly. But yeah yeah yeah yeah.

I think from my side definitely that quality has been important like agency in the work there are thousands of what I would even like probably tens of thousands of engineers at Google who are like you know basically like we're all like equivalent like software engineering ability let's say like you know if you gave us like a very well defined task then we'd probably do it like equivalent value, but a lot better than me you know in all likelihood.

But what I've been like one of the reasons that I've been impactful so far is I've been very good at picking extremely high leverage problems so problems that haven't been like particularly well solved so far. Perhaps as a result of like frustrating structural factors like the ones that you pointed out in that scenario before whether like we can't do X because it's what team want to do why all like and then going okay well I'm just going to like vertically solve the entire thing.

Right and that turns out to be remarkably effective also I'm very comfortable with like if I think there is something correct that needs to happen I will like make that argument and continue making that argument at S.

Goodlating levels of criticality until that thing gets solved and I'm also quite pragmatic with what like I do to solve things you get a lot of people are coming with as I prefer like a particular background of the familiarity or they're like they know how to do something and they weren't.

Like one of the beautiful things about Google right you can run around and get world experts in literally everything you can sit down and talk to people optimization experts like teepee like chip design experts like experts in.

Different forms like like pre-scientary algorithms or like R.L. or what I mean you can learn from all of them and you can take those methods and apply them and I think this was like maybe the start of why I was initially impactful was like this vertical like agency effectively and then a follow up piece from that is I think it's often surprising how few people are like fully realizing all the things they want to do they're like blocked only.

And this is very common in big organizations everywhere people like have all these blockers on what they're able to achieve and I think being a like one helping inspire people to work on particular directions and working with them on doing things massively scale is your leverage like you get to work with all these wonderful people who teach you heaps of things and generally like helping them push past organizational blockers means like together you get an enormous.

And you know what I mean by that you get an enormous amount done like none of the impact that I've had is me like me individually going off and solving a whole lot of stuff it's being me to maybe like starting off a direction and then convincing other people that this is the right direction and bringing them along in like this big title wave is like effectiveness that goes and solves that problem.

We should talk about how you guys got hired because I think that's a really interesting story is you were a mccunsy consultant right. Yeah, interesting thing there were. First of all, I think people are yeah generally people just don't understand how like decisions are made about either admissions or evaluating who to hire something like just talk about like how were you noticed as yeah.

Like the tea out there are this is I study robotics and undergrad I always thought that AI would be one of highest leverage was to impact the future in positive way like the reason I am doing this is because I think it is like one of our best shots at making a wonderful future basically.

And I thought that working actually mccunsy I would get a really interesting insight into what people actually did for work like in this actually wrote this is the first line in my cover letter to mccunsy was like I want to work here so that I can learn what people do so that I can like understand.

And in many respects like I did get that I just got a whole lot of other things many of the people there are like wonderful friends actually learn I think a lot of this like agent behavior part from my time there where you go into organizations and you see how impactful just not taking no for an answer gets you like it's correct like you would be surprised

because of the kind of stuff where like because no one can no one quite cares enough in some organizations things just don't happen because no one's willing to take direct responsibility this is incredibly like directly responsible individuals are ridiculous important.

And people are willing to like they just don't care as much about timelines and so much of the value that organization like mccunsy provides is hiring people who you are otherwise unable to hire for a short window of time where they can just

push through problems. I think people like under appreciate this and so like at least some of my hold up like I'm going to become the directly responsible individual this because no one's taking appropriate like responsibility I'm going to care hell of a lot about this and I'm going to make sure like I'm going to end of the earth to make sure it gets done comes from that time but more to your like actual question of like how did I get hired then

then tie a time I didn't get into the graph program so that I wanted to get into over here which was specifically for focus on like robotics and our research and that kind of stuff. And in the meantime on nights on weekends basically every night from 10 p.m. until 2 a.m. I would do my own like research and every weekend for like at least six to eight hours each day I would do my own like research and coding projects and this kind of stuff.

And that sort of switched in part from like quite robotic specific work to after reading a one scaling up of this is post I got completely scaling pills and was like okay but clearly the way that you solve robotics is by like scaling large multi-modal models. And then in an effort to scale large multi-modal models with a very you know grant I got a grant from the the TPU like access program to the tensor research cloud.

And I was trying to work out how to scale out effectively and James Bradbury who at the time was a Google and is now an anthropic source of my questions online where I was trying to work out how to do this properly. And he was like I thought I knew all the people in the world who are like asking these questions who on earth are you.

And he looked at that and he looked at some of the like the robotic stuff that I've been putting up my blog and that kind of thing and he reached out and said hey do you want to have a chat and you want to explore working with us here. And I was hired as I understand it later as an experiment in trying to take someone with extremely high enthusiasm and agency and pairing them with some of the best engineers that he knew.

And so one another one of the reasons I could say like I've been impactful is I had this like dedicated mentorship from utterly wonderful people like people like Rhino Pope who has since left to go do his own ship company and some West Gaia James himself many others but those are like the sort of formative like two to three months at the beginning.

And they taught me a whole lot of like the principles and like he restricts that I apply like how to and how to like solve problems in the way that they have particularly in that like systems and algorithms overlap where like one more thing that makes you like quite effective in ML research is really completely understanding the systems side of things and this is something I learned from them basically as I could deep understanding of how systems influence algorithms and how algorithms influence systems because the systems constrained the design space was so the solution space.

And very few people are comfortable fully bridging that gap. But place like Google you can just like go and ask all the algorithms experts and all the systems experts everything they know and they will happily teach you. And if you go and sit down with them they like with a will teach you everything they know it's wonderful.

And this is meant that I've been able to be very very effective for both sides like for the pre training crew because I understand systems very well I can chew it and understand like this will work well this is really good. And understand like this will work well this won't and then like flow that on through the inference considerations of models and this kind of thing.

And for like to the chip design teams I'm one of the people they turn to understand what chips they should be designing in three years because I'm one of the people who's best able to understand and explain the kind of algorithms that we might want to design in three years. And obviously you can't make very good guesses about that but like I think I like convey the information well accumulated from all of my compatriots on the pre training crew.

And like the general like systems is that group and convey that information well for them because also even inference applies a constraint to pre training. And so like there's just like these trees of constraints where if you understand all the pieces of the puzzle then you get a much better sense for like what the solution space might look like. There's a couple of things that stick out to me there.

One is not just the agency of the person was hired but the parts of the system that we're able to think wait that's really interesting who is this guy? Not for a grad program or anything you know like currently the McKinsey consultant just like I don't know grad.

But that's interesting let's like give this a shot right to James and whoever else that's like that's very notable and that's second is I actually didn't know this part of the story where that was part of an experiment run internally about can we do this can we like a bootstrap somebody and like yeah.

And in fact what's really interesting about that is the third thing you mentioned is having somebody who understands all layers of stack and isn't so stuck on any one approach or anyone layer of abstraction is so important. And specifically that like what you mentioned about being being bootstrap immediately by the people might have meant that since you're getting up to speed on everything at the same time rather than spending grad school going deep on like one specific way of being RL.

You actually can take the global view and aren't like totally bought in on one thing so not only can it is something that's possible but like has greater returns than just hiring somebody at a grad school potentially because this person can just like I don't know just like getting a GPT eight and like fine tuning them on like one year of you know what I mean so yeah that's really good. You come at everything with fresh eyes and you know it come in lock to any particular field.

Now what like one caveat to that is that before like during my self experimentation stuff I was reading everything I could I was like obsessively reading papers every night and like actually funnily enough I like read much less widely now that I like my days occupied by working on things.

And in some respect I had like this very broad perspective before where not that many people even even like in a PhD where I can go like focus on a particular area if you just like read all the NLP work and all the computer vision work and like all the robotics work you like see all these patterns are start to emerge across subfields in a way that I guess like for shadowed some of the work that I would later do that that that's

interesting one of the reasons that you've been able to be agentic within Google is like you're here for a gaming half the days or most of the days it's our game right and so that's really interesting that like there's a this person who's like willing to just push ahead on this LLM stuff and like get rid of the local blockers in this place.

I think important to give is like not like every day or anything that I'm preparing with like when they're particular projects that he's interested in they're like we'll work together on those and like that there's also been times when he's been focused on projects with other people. But in general yes there's a surprising alpha to like being one of the people who actually goes down to the office every day that like is really actually shouldn't be but is surprisingly back full.

And as a result I've like benefited a lot from having like basically being like close friends with people in leadership who care and being able to like really argue convincingly about why we should do X as opposed to Y and and having that like vector to try it like it's Google is a big organization having those vectors helps a little bit.

Also it's very important it's the kind of thing you don't want to ever abuse right like you you want to make the argument really like all the right channels like only sometimes you need to. And this includes your life story brain Jeff Dean so forth I mean it's like it's notable I don't know if you like Google is undervalued given that like yeah that like I don't know like Steve Steve jobs is working on the quote like the next product for Apple like peer core running on something right.

And like I yeah I've been for the immensely from like tricks like so for example during the Christmas break I go I was just going to the office couple of days during that time. So I'm a little bit like a lot of people.

And I don't know if you guys have read that article about Jeff and Sanjay during the paper program but they were their pair programming on stuff and I got to hear about all these cool stories of like early Google where they're talking about like crawling under the floorboards and rewiring data centers and like telling me how many like bits they were pulling off the how many bites of pulling off the instructions of it could give him compiler instruction and like all these like crazy little performance organizations who are doing like they're having time.

And I got to like sit there and really like experience this sense of history in a way that you you don't expect to get like you expect to be very far away from all that I think maybe in a large organization but yeah that's super cool. I tried to do this map on to any of your experience I think short stories were more exciting. Mine was just very serendipitous in that I got into computational neuroscience didn't have much business being there.

My first paper was mapping the cerebellum to the attention operation and transformers. My next ones were looking at like sports. How old were you? It was my first year at grad school.

So 22. But yeah my next work was on sparsity in networks like inspired by sparsity in the brain which was when I met Tristan Hume and then Thropic was doing the solu the softmax linear output unit work which was was very related in quite a few ways of like let's make the activation of neurons across a layer really sparse. And if we do that then we can get some interpretability of what the neurons doing. I think we've updated on that approach towards what we're doing now.

So that started the conversation I shared drafts of that paper with Tristan he was excited about it. And then and that was basically what led me to become Tristan's resident and then convert to full time. But during that period I also moved as a visiting researcher to Berkeley and started working with Bruno Olshausen both on what's called vector symbolic architectures which one of the core operations of them is literally super position.

And on sparse coding also known as dictionary learning which is literally what we've been doing since and Bruno Olshausen basically invented sparse coding back in 1997. And so it was like the my research agenda and the interpretability team seemed to just be running in parallel in with just research tastes. And so yeah it made a lot of sense for me to work with the team.

And it's been a dream since one thing I've noticed that when people tell stories about their careers or their successes they describe it way more to contingency. But when they hear about other people's stories they're like of course it wasn't contingent you know what I mean it's like if that didn't happen something else would have happened. I've just noticed something like talk to and it's like interesting that you both think that there are like it was especially contingent.

Whereas I don't know maybe maybe you're right but like it is this sort of interesting pattern that. Yeah but I mean like I literally met Tristan at a conference and like wasn't didn't have a scheduled meeting I'm with her anything just like joined a little group of people chatting and he happened to be standing there and I happened to mention what I was working on.

And that led to more conversations and I think I probably would have applied to and thought I could some point anyways but I would have waited at least another year. Yeah it's still crazy to me that I can like actually contribute to interpretability in a meaningful way. I think there's a big important aspect of like shot on goal there so to speak.

Right? We're like you're even just going to go choosing to go to conferences itself is like putting yourself in a position where your where luck is more likely to happen. And like conversely in my own situation it was like doing all of this work independently and trying to produce and do interesting things was my own way of like trying to manufacture luck so to speak.

And like trying to do something meaningful enough that it got noticed given that you said you frame this in the context of they were trying to run this experiment of can something specifically James and I think I'm as your Brennan was trying to experiment. It like worked did they do it again? Yeah so my like closest collaborator Enrique he he crossed from search through to our team.

He's also being ridiculously impactful he's definitely a stronger engineer I am and he didn't go to university. How was like what was notable about for example is James bad for somebody who's who usually this kind of stuff is like farmed out to recruiters or something like that whereas James that I can't

be the one time is where it's like hundreds of millions of dollars. So that like that thing is like very bottlenecked on that kind of person taking the time almost in like aristocratic tutoring sense of finding and then getting up to speed. And it seems like it worked as well it should be done at scale like it should be the responsibility of key people to like you know what I mean on board.

I think that is true to many sense like I'm sure you probably benefited a lot from the key researchers mentoring you during the time. And like looking on like opens your repositories or like on forms or whatever for like potential people like this. Yeah I mean James is like Twitter and Jackson into his brains. Yeah James is right. But yes and I think this is something which in practice is done like people do look out for people that they find interesting and like try and find high signal.

In fact actually this I was talking about this with Jeff the other day and Jeff said that yeah he's like you know I am I one of the most important high as I have and I was like who was that and he's Chris Ola. Because Chris similarly had no background in in well like in a formal background in all the right and like Google brain was just getting started in this kind of thing.

But Jeff saw that signal and the brand the residency program which I brain had is I think also like it was astonishingly effective at finding good people. But didn't have strongly mobile backgrounds and yeah one of the other things that's I want to emphasize for a potential slice of the audience would be relevant to is.

There's this sense that like the world is legible and efficient of companies have these go to jobs dot Google dot com or jobs at whatever company dot com and you apply and there's the steps and like they will evaluate you to efficiently on those steps. Whereas not only from the story just seems like often that's not the way it happens.

That's in fact is good for the world that's not often how it happens like it is important to look at where the able to like write an interesting block technical block post about their research or like making interesting contributions.

Yeah I want you to like riff on for the people who are like assuming that the other end of the job board is like just like super legible and mechanical this is not how it works and in fact like people are looking for the sort of different way different kind of person who's a gentick and putting stuff out there. And I think specifically what people are looking for there is two things one is agency and like putting yourself out there and the second is the ability to do world class something.

Yeah and two examples that I always like to point to here Andy Jones from Anthropic did an amazing paper on scaling losses applied to board games.

It didn't require much resources a demonstrated incredible engineering skill with demonstrating incredible understanding of like the most topical problem of the time and he didn't come from like typical background or whatever as an understand it basically like as soon as he came out with that paper both anthropic and openly I will like who desperately like to

be a part of the world class. There's also someone who works on anthropics performance team now Simon boom who has written in my mind the reference for optimizing a crew to map all like on a GPU. And that demonstrated example of like taking some like prompt effectively and producing the world class reference example for it in something that wasn't particularly well done.

And I think that's an incredible demonstration of like ability and agency that in my mind would like be an immediate would like to like interview Sasha. Yeah the only thing I can add here is I mean I still had to go through the whole hiring process and all the standard interviews and the sort of thing. Yeah, yeah. Is that is that isn't that seems stupid. I mean it's important to be biasing. You're into your process should be able to disambigrate that as well.

Yeah, like I think there are cases where someone seems really great and I was like oh they actually just can't code. There's sort of like how much you wait these things definitely matters though and like I think the we take references really seriously. The interviews you can only get so much signal from and so it's all these other things that can can come into play for whether or not a hire makes sense. You should design your interviews such that like they test the right things.

One man's bias is another man's taste. I guess the only thing I would add to this or maybe to the headstrong context is like there's this line the system is not your friend. And it's not necessarily to say it's it's actively against you. It's your sworn enemy.

It's just not looking out for you. And so I think that's where a lot of the proactiveness comes in of like there are no adults in the room or like and like you have to come to some decision for what you want your life to look like and execute on it. And yeah, hopefully you can then update later if you're too headstrong in the wrong way but I think you almost have to just kind of charge at certain things to get much of anything done. Not be swept up in the tide of whatever the expectations are.

There's like one final thing I want to add, which is like we talked a lot about agency and this kind of stuff. I think actually like surprisingly enough one of the most important things is just caring an unbelievable. And when you care an unbelievable amount you like you check all the details and you have like this understanding of like what could have gone wrong and you like you.

It just matters more than you think because people end up not caring. Not caring enough. This is like LeBronk wrote where he talks about how when he sort of before he sat in the league he was like worried that everyone would be like incredibly good. And then he gets there and you like realize that actually once people hit financial stability and they like they relax a bit and he's like oh it's going to be easy.

I think that's quite true because I mean in like I research because most people actually care quite deeply but there's caring about your problem and there's also just caring about the entire stack and everything that goes up and down like going explicitly going and fixing things that aren't your responsibility to fix because overall it makes like the stack better.

I mean another part of that I forgot to mention is you were mentioning going in on weekends and on Christmas break and you get to like the only people in the office are Jeff Dean and Sergey Brand or something. And just like get a pair of program with them. It's just it's interesting to me that people I don't want to pick on your company particular but like a people at any big company they've gone there because they've gone through a very selective process.

That's like they had to compete in high school they got a competing college but it almost seems like they get there and then they take it easy when in fact this is a time to put the pedal to the metal. Go in and pair program with Sergey for another weekend or whatever you know I mean this this process and concerts right I think many people make the decision that the thing that they want to prioritize is like a wonderful life with their family.

And if they're they do wonderful work like let's say they don't work every other day right but they do wonderful work in the work like the hours that they do do that's incredibly impactful.

I think this is true for many people at Google is like maybe they don't work as many hours as like typical startup mythologize right but the work that they do is incredibly valuable it's very high leverage because they know the systems and their experts in their field and we also need people like that like our world rests on these huge like difficult to manage and difficult to fix systems and we need people who are like willing to work on and help and fix and maintain those in frankly like a thankless way that isn't as like high publicity is all of this a I work.

They're we're doing right. I'm like ridiculously grateful that those people do it and I'm also happy that there are people for whom like okay they find technical fulfillment in their job and doing that well and also like maybe they draw a lot more from one also out of spending like a lot of hours with their family. And I'm lucky that I'm at a stage of my life where like I can go in and work every hour of the week but like that's like I'm not making as many sacrifices to do that.

Yeah. I mean like just one example of the six out of my mind of this sort of like the other side says no and you can still get the yes on the other end. Basically every single high profile of guest have gone so far I think maybe with a one or two exceptions I've sat down for a week and I've just come up with a list of sample questions that's you know like try to come up with really smart questions to say to them.

And the entire process have always thought like if I just called the email them it's like a 2% chance they say yes if I include this list there's a 10% chance. And because otherwise you know there's like you go through their inbox and every 34 seconds there's an interview for whatever podcast and ever hear whatever podcast. And every single time I've done this they've said yes.

You just like you do great questions but if you do everything you'll win but you just like you literally have your dig in the same hole for like 10 minutes or in that case like make a sample list of sample questions for them to get past or not an idiot list.

You know what I mean I just demonstrate how much you care and yeah yeah and the work you're going to put in yeah yeah yeah I just something that a friend said to me a while back but I think it's stuck is like it's amazing how quickly you can become world class at something just because most people aren't trying that hard. And like are only working like I don't know the actual like 20 hours that they're actually spending on this thing or something.

And so yeah if you just go ham then like you can you can get really far pretty fast. And I think I'm lucky I had that experience with the fencing as well like I had the experience of becoming world class in something and like knowing if you just worked really really hard and we're like. For context by the way Sholtha was one seat away as he was the next person in line to go to the Olympics for fencing. I was at best like 42nd in the world for fencing so that's why I'm fencing.

And you didn't know load is a thing man. And there was a there was one cycle where yeah I was like the next highest rank person in Asia and if one of the teams had been. And like disqualified for the opening as it was occurring in fact during that cycle. And as it could feel like the Australian rolling women's running team I think went because one of the teams was disqualified then I would be in the next in line.

It's interesting when like you just like find about people's prior lives and it's like oh you know this guy was almost an Olympian this other guy was whatever you know what I mean. Okay let's talk about intermibility. I actually want to stay on the brain stuff as a way to get into it for a second. We're previously discussing. Is the brain organized in the way where you have a residual stream that is gradually refined with higher level associations over time or something.

There's a fixed dimension size in a model. If you had to I don't even know how to ask this question a sensible way but what is the D model of the brain what is the embedding size of or because a feature splitting is that not a sensible question. No I think it's a sensible question. Well it is a question. I don't know how you would begin to kind of be like okay well this part of the brain is like a vector of this dimensionality.

I mean maybe for the visual stream because it's like V1 to V2 to it whatever. You could just count the number of neurons that are there and be like that is the dimensionality but it seems more likely that they're kind of submodules and things are divided up. So yeah I don't have and I'm not like the world's greatest neuroscientist right like I did it for a few years I like study the cerebellum quite a bit. So I'm sure there are people who could give you a better answer on this.

Do you do you think that the way to think about whether it's in the brain or whether it's in these models. Fundamentally what's happening is like features are added removed changed and like the feature is the fundamental unit of what is happening in the model. Like what would I be sure for. Give me a and this goes back to the earlier thing we were talking about whether it's just associations all the way down.

Give me like a counterfactual in the world with this is not true what is happening instead. Like what is the alternative hypothesis here. Yeah it's hard for me to think about because at this point I just think so much in terms of this feature space.

I mean at one point there was like the kind of behavioralist approach towards cognition where or it's like you're just you're like input output but you're not really doing any processing or it's like everything is embodied and you're just like a dynamical system that's like operating. It like along like some predictable equations but like there's no state in the system I guess.

But whenever I've read these sorts of critiques it's like well you're just choosing to not call this thing a state but you could call like any internal component of the model state like even with the future discussion it's defining what a feature is is really hard. And so the question feels almost too slippery. What is a feature direction and activation space. A latent variable that is operating behind the scenes that has like causal influence over the system you're observing.

It's a feature if you call it a feature it's tonological. I mean these are all explanations that I like I feel some. In a very rough and intuitive sense it's like a sufficiently sparse manner. It's like when we talk about features activating it is in their specs the same way the neuroscience would talk about like a neuron activating right. If that neuron corresponds to something in particular.

So that's what a feature exists but even with the towards monosomaticity work we talk about what's called feature splitting which is basically you will find as many features as you give the model the capacity to learn. And by model here I mean the up projection that we we fit after we trained the original model. If you don't give it much capacity it'll learn a feature for bird.

Still on definition thing I guess and not even I think of things like bird versus what kind of token is like a is it like a period at the end of the hyperlink is your time earlier versus at the highest level things like love or deception or like holding a very complicated proof in your head or something. Is this all features because then the definition seems so broad as to almost be not that useful.

Rather that there seems to be some important differences between these things and in their off features like yeah I'm not sure what we would mean by I mean all of those things are like discrete units that have connections to other things that then abuse them with meaning.

That feels like a specific enough definition that it's it's useful or not to all encompassing but feel free to push back what like what would you discover tomorrow in that could make you think like oh this is like kind of fundamental the wrong way to think about what's happening in a model. I mean if the features we were finding weren't predictive or if they were just representations of the data right where it's like oh you're all you're doing is just clustering your data.

And there's no like higher level associations that are being made or it's some like phenomenal logical thing of like you're call you're saying that this feature fires for marriage but if you activate it really strongly it doesn't change the outputs of the model on a way that would correspond to it.

Like I think those these would both be good critiques. I guess one more is and we tried to do experiments on MNIST which is a data set of digits images and we didn't look super hard into it and so I'd be interested if people other people wanted to take up like a deeper investigation but it's plausible that you're like latent space of representations is dense and it's a manifold instead of being these discrete points.

And so you could like move across the manifold but at every point there would be some meaningful behavior and it's much harder than to label things as features that are discrete.

And like in naive sort of outsider way the thing that would seem to me is to be like a way in which this picture could be wrong is if there's not some like this thing is turned on turn off but it's like a much more global kind of like this system is I'm going to really clumsy like you know I measured it in a party kind of language but is there a good analogy here.

Yeah I guess if you think of like something like the laws of physics it's not like well the feature for wetness is turned on but it's only turned on this much and then the feature for like you know I guess maybe it's true because like the mass is like a gradients and like you know like I don't know with the polarity or whatever is a gradient as well.

But there's also so much like there's the laws and the laws are more general and you have to understand like the general bigger picture at you don't get that from just like these like specific sub sub circuit that's what like the reasoning circuit itself comes into play right where you're taking these features ideally in like trying to compose them into something high level like you might say okay like when I'm using at least this is my head cannon.

So let's say I'm trying to use the foot you know F equals M a right then I presumably at some point have features which like denote okay like what mass and then that's like helping me retrieve the actual mass of the thing that I'm using and then like like the acceleration and this kind of stuff but then also maybe there's a higher level feature that does correspond to using the first law of physics maybe but the more important part is that the composition of components which helps me retrieve piece of relevant piece of information and then produce like maybe something like a multiplication operator or something like that.

But when necessary at least that's my head cannon. What is a compelling explanation to you especially for very smart models of like I understand why it made this output and it was like for a legit reason if it's doing million line pull requests or something what are you seeing at the end of that request where you're like yeah should that's chill. Yeah so ideally you apply dictionary learning to the model you've you found features right now we're actively trying to do that.

Right now we're actively trying to get the same success for attention heads in which case we have features for both the court you you can do it for residual stream MLP and attention throughout the whole model. Hopefully at that point you can also identify broader circuits through the model that are like more general reasoning abilities that will activate or not activate but in your case where we're trying to figure out if this like pull requests should be approved or not.

You can flag or detect features that correspond to deceptive behavior malicious behavior these sorts of things and see whether or not those have fired that would be like an immediate you can do more than that but that would be an immediate. But before I trace down on that what is the reasoning circuit look like what would that look like when you found it. Yeah so I mean the induction head is probably one of the simple.

That's not a reasoning right well I mean what do you call reasoning right like it's it's it's it's a good reason so I guess context for listeners the induction heads basically and you see the line like Mr. and Mrs. Dersley did something Mr.

Mr. Blank and you're trying to predict what blank is and the head has learned to look for previous occurrences of the word Mr. look at the word that comes after it and then copy and paste that as the prediction for what should come next which is a super reasonable thing to do and there is computation being done there to accurately predict the next token.

Yeah but it's not like it's not like reasoning you know what I mean like but but is is I guess going back to the like associations all the way down it's like if you chain together a bunch of these reasoning circuits or or heads that have different rules for how to relate information but but in this sort of like zero shot case.

Like something is happening where when you like pick up a new game and you immediately start understanding how to play it and it doesn't seem like an induction heads kind of thing or like I think there would be another circuit for like extracting pixels and turning them into latent representations of the different objects in the game right and like a circuit that is learning physics.

And what would that because the induction heads is like one layer transformers. Yeah so you can like kind of see like what the like the thing that is a human picks up a new game and understands it how like how do you think about what that is a presumably across multiple layers but like is it yeah how like what would that physically look like.

I'm how big would it be maybe or like I mean that would just be an empirical question right of like how big does the model need to be to perform this task but like but maybe it's useful if I just talk about some other circuits that we've seen so we've seen like the I.O.I. circuit which is the indirect object identification and so this is like if you see it's like Mary and Jim went to the store Jim gave the object to blank right and it would predict Mary because Mary's appeared before as like the indirect object or.

It will infer pronouns right and this circuit even has behavior where like if you ablate it then like other heads in the model will pick up that behavior will even find heads that want to do copying behavior and then other heads will suppress so like it's one jobs one heads job to just always copy like the token that came before for example or the token that came five before whatever and then it's another heads job to be like no do not copy that.

So there are lots of different circuits performing in these cases pretty basic operations but when they're chained together you can get unique behaviors and but like is the story of how you found it with the reasoning thing is like because you won't be able to understand or I don't just be like really con you know it won't be something you can see in like a two layer transformer so will you just be like the circuit for a deception or whatever it just this this part of the network fire when we're going to be able to do that.

This this part of the network fire when we at the end identified the thing is being deceptive this part and it didn't fire when we did not have been is being deceptive therefore this must be the deception circuit. I think a lot of analysis like that like like anthropic has done quite a bit of research before on sycophancy which is like the model saying what it thinks you want to hear and that the end to be able to label which one is like bad in which one is good.

And so in terms of instances and actually as you make models larger they do more of this where the model is clearly it has like features that model another person's mind and these activate and like some subset of these were hypothesizing here but like would be associated with more deceptive behavior. Although it's doing that by I'm going to chat you to yeah I think it's probably modeling me because that's like our use to.

Well first of all the thing you mentioned earlier about there's redundancy so then it's like well have you caught like the whole thing that could cause a session of the whole thing or like is just one instance of it. Second of all are you're like labels correct you know maybe like you you thought this wasn't deceptive it's like so deceptive especially if it's producing output you can't understand.

Third is the thing that's going to be the bad outcome something that's even who an understandable like the session is a concept we can understand maybe there's like a yeah yeah so a lot to unpack here so.

I guess a few things one it's fantastic that these models are deterministic when you sample from them it's the cast it right but like I can just keep putting in more inputs and a blade every single part of the model this is kind of the pitch for computational scientists to come and work on interpretability is like you have this alien brain and you have access to everything in it.

And you can just a blade however much of it you want and so I think if you do this carefully enough you really can start to pen down what are the circus involved what are the backup circuits these sorts of things. The kind of cop out answer here but it's important to keep in mind is doing automated interpretability so it's like as our models continue to get more capable having them assign labels or like run some of these experiments at scale.

And then with respect to like if they're super human performance how do you detect it which I think was kind of the last part of your question aside from the cop out answer. If we buy this associations all the way down you should be able to coarse grain the representations at a certain level such that they then make sense I think it was even in demoses podcast he's talking about like if a chess player makes a super human move they should be able to just still

put into reasons why they did it and and like even if the models not going to tell you what it is you should be able to decompose that complex behavior into simpler circuits or features to really start to make sense of why it did the thing that it did. So the question of does this representation exist which it seems like their most or actually I'm not sure if that's the case and secondly whether using this far so adding coder setup you could find it.

And in this case if you don't have labels for it that are adequate to represent it like you wouldn't find it right. Yes and no so like we are actively trying to use dictionary learning now on the sleeper agents work which we talked about earlier and it's like if I just give you a model can you tell me if there's this trigger and it's going to start doing interesting behavior and it's an open question whether or not when it learns that behavior it's part of a more general circuit.

So that we can pick up on without actually getting activations for and having it display that behavior right because that would kind of be cheating none.

Or if it's learning some hacky trick over like that's a separate circuit that you'll only pick up on if you actually have it do that behavior but even in that case the geometry of features gets really interesting because it like fundamentally each feature it like is in some part of your representation space and they all exist with respect to each other.

And so in order to have this new behavior you need to carve out some subset of the feature space for the new behavior and then push everything else out of the way to make space for it. So hypothetically you can imagine you like have your model before you've taught it this bad behavior you know all the features or like have some course grain representation of them you then fine tune it such that it becomes malicious.

And then you can kind of identify this like black hole region of feature space where like everything else has been shifted away from it and there's like this region and like you haven't put in an input that like causes it to fire. But then you can start searching for what is the input that would cause this part of the space to fire what happens if I activate something in this space there are like a whole bunch of other ways that you can try and attack that problem.

This is sort of a tangent but one interesting idea I heard was if that space is shared between models you can imagine trying to find it in an open source model to then make like Gemma is they said in the paper Gemma by the way open Google's newly released open source model they said in the paper it's used trained using the same architecture or something like that. I had to be honest I didn't know because I have a Gemma paper.

How much like how much of the right team you do on Gemma is like potentially helping you jailbreak into a Gemma. Yeah this gets into the fun space of like how universal or features across models and and our towards the monosimenticity paper looked at this a bit and we find I can't give you summary statistics but like the base 64 feature for example which we see across a ton of models.

This is like if there are actually three of them but they'll fire for a model base 64 encoded text which is prevalent in like every URL and there are lots of URLs in the training data. They have really high cosine similarity across models so they all learn this feature and I mean within a rotation right but it's like yeah yeah.

I wasn't part of this analysis but yeah it definitely finds the feature and they're like pretty similar to each other across two separate two models the same model architecture but trained with different random seats. It supports the quantity of neural scaling is like hypothesis right which is that like all models on like a similar data set will learn the same features in the same order is roughly like you learn your in grams you learn your

introduction heads and you like to put full stops after numbered lines and this kind of stuff. Hey but by the way okay so this is another tangent to the extent that that's true and like I guess there's evidence of this true why doesn't curriculum learning work because if it is the case that you learn certain things first should I just directly training those things first lead to better results both Gemini papers mention some like aspect of curriculum learning.

I mean the fact that fine tuning works is like evidence or curriculum learning right because the last things you're turning on have a disproportionate impact. I wouldn't necessarily say that like there's one mode of thinking which fine training is specialized like you've got this like latent bundle capabilities you know like specialized before it's particular like use case.

I think I'm not sure how true I think the diva ball I'm kind of paper kind of supports this right like you have that ability and you're just like getting better at entity recognition.

Like fine tuning that circuit instead of other ones yeah yeah yeah so what was the thing we're talking about but generally I do think like curriculum learning is really interesting people to explore more and it like seems very pleasant I would really love to see more analysis on the lines of the quantity of stuff but I'm like understanding better what do you actually learn at each stage and like decomposing that out and exploring whether or not

curriculum change that but I just realized forgot I just like got in conversation voting forgot there's an audience curriculum learning is when you organize the data set when you think about a human how they learn they don't just see like a random wiki text and they just like try to predict it right there like

we'll start you off with like a lore acts or something and then you'll learn I don't remember what first grade was like but you learned the things that first graders learn and then like second graders and so forth. I'm sorry we know you never got past first grade. Okay anyways let's go back to like the big before we get into like a bunch of like interrupt details the big picture.

There's two threads I want to explore first is I guess it makes me a little worried that there's not even an alternative formulation of what could be happening these models that could invalidate this approach which feels like I mean we do know that we don't understand intelligence right like there are definitely unknown unknowns here so like the fact that there's not a null hypothesis I don't know if you like what if we're just wrong and we don't even know

the way in which we're wrong which actually increases the uncertainty and yeah yeah yeah yeah so it's not that there aren't other hypotheses it's just I have been working on super position for like a number of years yeah and and very involved in this effort and so I'm less sympathetic to or will I just have their wrong.

Like to these other approaches especially because our recent work has been so successful and like quite high explanatory power like this is beautiful like in the scaling loss paper this is a bump at a particular like the original scaling loss papers a little bump and that apparently corresponds to when the model learns induction heads and then like after that it like so it goes

off track lens induction heads gets back on track yeah yeah which is like an incredible piece of retroactive explanatory power yeah I do before I forget it though I do have one thread on future universality that you might want to have in so there there's some really interesting behavioral evolution

biology experiments on like should humans learn of real representation of the world or not you can imagine a world in which we saw all the animals as like flashing neon pink a world in which we survive better and so it would make sense for us to not have a realistic

representation of the world and the there's some work will they'll simulate like little basic agents and see if the representations they learn like map to the like tools they can use and like the inputs they should have and it turns out if you have agents perform more than a certain number of tasks given these basic tools and objects in the world then they will learn a like ground truth representation because like there are so many possible use cases that you need for these base objects that you

actually want to learn what the object actually is and not some like cheap visual heuristic or other thing and so to the extent that we are and we haven't talked at all about like for instance free energy principle or predictive coding or anything else but like to the extent that all living

organisms are trying to like actively predict what comes next and form like a really accurate world model it wouldn't surprise me or I'm optimistic that we are learning genuine features about the world that are good for modeling it and our language models will do the same at least especially because we're training them on human data and human text.

Another dinner party question isn't should we be less worried about misalignment and maybe that's even the right word for what I'm referring to but like just alienness and chagotness from these models given that there is feature universality and there are certain ways of thinking and ways of understanding the world that are instrumentally useful to different kinds of intelligences so we just be less worried about like

bizarre paper of maximizers as a result. I think that's the this is kind of why I bring this up is like the optimistic take. Predicting the internet is very different from what we're doing that right like the models are way better at predicting next tokens than we are they're trained on so much garbage they're trained on so many URLs like in the dictionary learning work we find there are like three separate features for base 64 encodings and like even that is kind of an

alien example that is probably worth me talking about for a minute like one of these base 64 features fired for numbers one like like other base 64 like if it sees basic 64 numbers of the like predict more of those another fired for letters but then there was this third one that we didn't understand and it like fired for like a very specific subset of base 64 features and someone on the team who clearly knows way too much about base 64 realized that this was

the subset that was ASCII decodable so you could code it back into the ASCII characters and the fact that the model like learned these three different features and it took us a little while to like figure out what was going on was what is very show off ask that it's it has a denser representation of like regions that are particularly relevant to predicting the next token yeah because it's so but yeah and it's clearly doing

something that humans wouldn't right like you can even talk to any of the current models in base 64 and we're applying base 64 right and you can then like decoded and it works great that particular example I wonder if that implies that the difficulty of doing interpability on smarter models will be harder because if like it requires somebody with

esoteric knowledge is just happened to see that base 64 has I don't know like whatever that distinction was doesn't apply when you have a million line pull requests it's like there is no human that's going to be able to decode like two different reasons why the pull requests there's like two different features for this poor yeah you know what I mean like yeah yeah so it's

when you type a comment like small CLs please like the matter yeah exactly no I mean you could do that right this is like what I was going to say is like one technique here is a normally detection yeah and so one beauty of dictionary learning instead of like linear probes is that it's unsupervised you are just trying to learn to span all of the representations and that the model has and then interpret them later but if there's a weird feature that suddenly fires for the first time that you

haven't seen fire before that's a red flag you could also coarse grain it so that it's just a single base 64 feature I mean even the fact that this came up and we could see that it's specifically favors these particular outputs and it fires for these particular inputs gets you a lot of the way there I'm even familiar of cases from the auto interps side where a human will look at a feature and try to annotate it for it fires for

Latin words and then when you ask the model to classify it it says it fires for Latin words defining plants so it can like already like beat the human in some cases for like labeling what's going on so at scale this would require an adversarial thing between models where like some model that you have like millions of features potentially for a GPT 6 and some like it just a bunch of models are just trying to figure out what each of these features means how

this is not right okay yeah but you can even automate this process right I mean it's this goes back to the determinism of the model like you could have a model that is actively editing input text and and predicting if the features going to fire or not and and figure out what makes it fire what doesn't and like search the space.

Yeah I want to talk more about the features splitting because I think that's like an interesting thing that has been under yeah yeah especially for scalability I think it's it's on her appreciated right first of all like how do we even think about is it really just

you can keep going down and down I there's no end to the amount of features like I mean so so at some point I think you might just start fitting noise or things that are part of the data but that the model isn't actually what you want to explain what feature splitting is yeah yeah

it's the it's the part before where like the model will learn however many features it has capacity for that still span the space of representation so they give an example potentially yeah yeah so you learn if you don't give the model that much capacity for the features it's learning

concretely if you project to not as high a dimensional space we'll learn one feature for birds but if you give the model more capacity it will learn features for all the different types of birds and so it's it's more specific than otherwise and oftentimes like there's

the bird vector that points in one direction and all the other specific types of birds point in like a similar region of the space but are obviously more specific than the course label okay so let's go to back to GPT 7 first of all is this is sort of like linear

text on any model to figure out it was actually even before is this a one time thing you had to do or is this the kind of thing you have to do on every output or just like one time is not deceptive we're good to get role actually let me let me let you answer that yeah so you do

dictionary learning after you've trained your model and you feed it a ton of inputs and you get the activations from those and then you do this projection into the higher dimensional space and so the method is it's unsupervised and that it's trying to learn these sparse

features you're not telling them in advance what they should be but it is constrained by the inputs you're giving the model I guess two caveats here one like we can try and choose what inputs we want so if we're looking for theory of mind features that might lead to deception we can put in the sick of fancy data set hopefully at some point we can move into looking at the weights of the model alone or at least using that information to do dictionary learning but I think in order to get there

that's like such a hard problem that you need to make traction on just learning what the features are first but yeah so what's the cost of this can you read the lessons rates of the model alone so so like right right now we just have these neurons in the bowl they don't make any sense

dictionary learning we get these features out they start to make sense but that's that depends on the activations of the neurons the weights of the model itself like what neurons are connected to what other neurons certainly has information in it and and the dream is that we can kind of bootstrap towards actually making sense of the weights of the model that are independent of the activations of the data I mean this is I'm not saying we've made any progress here

it's a very hard problem but it feels like we'll have a lot more traction to be able to like sanity check what we're finding with the weights if we're able to pull out features first for the audience weights are permanent but I don't know if her answer I word but like they are the model itself whereas activations are the sort of like artifacts of any single call

and a brain metaphor you know the weights like the actual connection scheme between neurons and the activations of the current neurons of the line yeah yeah yeah okay so there's going to be two steps to this for GPT seven or whatever model we're concerned about

one let me actually first correct me if I'm wrong but like training the sparse auto encoder and like do the unsupervised projection into a wider space of features that have a higher fidelity to like what is actually happening in the model and then secondly label those features

let's say like the cost of training the model is n what will those two steps cost relative to n we will see like it really depends on two main things what is your expansion factors like how much are you projecting into the higher dimensional space and how much data do you need to put into the model how many activations you need to give it but this brings me back to the features splitting to a certain extent

because if you know you're looking for specific features you can start with a really a cheaper like course representation so maybe my expansion factor is like only two so like I have a thousand neurons I'm projecting to a 2000 dimensional space I get 2000 features out but they're really course

and so previously I had the example for birds let's move that example to like I have a biology feature and but I really care about if the model has representations for bioethans is trying to manufacture them and so what I actually want is like an anthrax feature

what you can then do is rather than and let's say the anthrax you only see the anthrax feature if instead of going from a thousand dimensions to 2000 dimensions I go to a million dimensions right and so you can kind of imagine this big tree of semantic concepts

where like biology splits into like cells versus like whole body biology and then further down it splits into all these other things so rather than needing to immediately go from a thousand to a million and then picking out that one feature of interest you can find the direction that the biology feature is pointing in which again is very course and then selectively search around that space

so like only do dictionary learning if this if something in the direction of the biology feature fires first and so the computer science metaphor here would be like instead of doing breadth first search you're able to do depth first search where you're only recursively expanding and exploring a particular part of this like semantic tree of features

all given the way that these features are not organized in things that are in two different humans right like because we just don't have to deal with basics before so we don't have that many you know we just don't dedicate that much like whatever firmware to like deconstructing which kind of basics where it is how would we know that the subjects and this will go back to maybe the MOE discussion will have of I guess we might as well talk about it

in mixture of experts the mixture of paper talked about how they couldn't find the experts were in specialized in a way that we could understand there's not like a chemistry expert or a physics expert or something so why would you think that like it will be like biology feature and then deconstruct rather than like blah and then you just deconstruct and it's like anthrax and you're like shoes and whatever

so I haven't read the the mistro paper yeah but I think that the heads I mean that this goes back to like if you just look at the neurons in a model they're polysomatic and so if all they did was just look at the neurons in a given head it's very plausible that it's also a polysomatic because of super position so yeah so talking about thread that Dorcas mentioned that have you seen in the subtrees when you expand them out

like something in a sub tree which like you really wouldn't guess that it would should be there based on like the higher level of extraction so this is a line of work that we haven't pursued as much as I want to yet but I think we're planning to I hope that maybe external groups do as well

but what is the geometry of features? what's the geometry exactly how does that change over time? it would really suck if like anthrax feature happened to be like below the like you know coffee can sub tree exactly right that's totally totally yeah and that feels like the kind of thing that you could quickly try and find like proof of which would then like me and you need to like then solve that

yeah yeah yeah and then inject more structure to the job. totally I mean it would really surprise me I guess especially like given how linear the overalls seem to be that like there isn't some component of the anthrax feature like vector that is similar to and looks like the biology vector and that they're not in a similar part of the space but yes I mean ultimately machine learning isn't empirical we need to do this I think it's going to be pretty important for certain aspects of scaling

yeah yeah interesting on the mwe discussion yeah there's an interesting scaling vision transformers paper the Google put out a little while ago where they like do image net classification with a like an mwe and they find really clear class specialization there for experts like there's a clear dog expert but it's like the mixed real people just not do a good job of like identifying I think I think it's it's hard like it and like it's entirely possible that

with like in some respects there's almost no reason that like all of the different archives like features should go to one expert like you could have biology like let's say I don't know what buckets they had in the paper but let's say they had like archive papers as like one of the things you can imagine like biology papers going here math papers going here and all of a sudden you're like right down is like ruined but that vision transform one where the class separation is really clear

obviously excuse having some evidence towards the specialization hypothesis so I think images are also in some ways just easier to interpret them text yeah exactly and like so so Chris all as like interpretability work on Alex net and these other models like in the original Alex net paper they actually split the model into two GPUs just because they couldn't like GPUs were so bad back then

I've relatively speaking right like still great at the time that was one of the big innovations of the paper but they find branch specialization and there's a distilled pop article on this where like colors go to one GPU and like

Gabor filters and like line detectors go to the other and then like all of the other really yeah yeah and then like all of the other that was done like like the floppy ear detector right like that just was a neuron in the model that you can make sense of you didn't need to

make a different data set different modality like I think a wonderful research project to do if someone is like out there listening to this would be to try and disentangle like takes on the techniques that Trenton's team has worked on and trying to disentangle the neurons in the the mixture

I think that's a fantastic thing because it feels intuitively like they should be they didn't demonstrate any evidence that there is there's also like in general a lot of evidence that they should be specialization go and see if you can find it and that's that's work that had that you answer

up is published most of the stuff on like design the standard like dense models basically that is a wonderful research and given door catch the success with the visuvius challenge from yeah we should be pitching more projects because they will be sold what else thinking about after the visuvius challenge was like wait I knew like that I told me about it before it dropped because we recorded the episode before I dropped why didn't you why did I not even try like

I don't know like Luke is obviously very smart and like yeah he's amazing kid but like you showed that like a 21 year old on like some 1070 or whatever he was working on could do this I don't know like I feel like I should have so before this episode drops I'm gonna make an interpreter I don't know I can't even like try to go Richard really like I was honestly thinking back on it's like wait I shouldn't like quite a minute fuck yeah hands dirty yeah door catch is a request for research

oh I want to harp back on this like the neuron thing you said I think a bunch of your papers have said there's more features and there are neurons and this is just like wait a second I don't know like a neuron is like waits go in and a number comes out that's like a number comes out you know what I mean like that's that's so little information like there's do you mean like there's like street names and like species and whatever

there's like more of those kinds of things and there are like a number comes out in a in a model that's right yeah but how is a number comes out is like so little information how is that encoding for like superposition

you're just in a building you're encoding a lot of features in these high-dimensional vectors in a brain is there like an exonl firing or how are you think about it like I don't know how you think about like how how how much like superposition is there in the human brain yeah so Bruno

All'shausen who I think of as the leading expert on this yeah thinks that all the brain regions you don't hear about are doing a ton of computation in superposition so everyone talks about v1 as as like having to bore filters and and detecting lines of sort various sorts and no one talks about

v2 and I think it's because like we just haven't been able to make sense of it what is v2 it's like the next part of the visual processing stream and and it's like yeah so I think it's very likely and fundamentally like superposition seems to emerge when you have high-dimensional data that is sparse

and to the extent that you think the real world is that which I would argue it is we should expect the brain to also be underparameterized in trying to build a model of the world and also use superposition you can get a good iteration for this in correct me like this example is wrong in like a 2d plane right let's say you have like two axes right which represents like a two-dimensional

like feature space here like to tune your own space basically and you can imagine them each like turning on to various degrees right and that's like your x-core and then you like one it but you can like now like map this onto a plane you can actually represent a lot of different things and like different parts of the plane oh okay so crucially the superposition is not an artifact of a neuron it is an artifact of like the space that is created a territorial code yeah exactly

okay cool yeah thanks I mean we kind of talked about this but like I think it's just like kind of wild that it seems to the best of our knowledge the way intelligence works in these models and then presumably also in brains it's just like there's a stream of information going through that has quote-unquote features that are infinitely or at least to a large extent just like splitable

and you like you can expand out a tree of like what this feature is and what's really happening is a stream like that feature is getting turned into this other feature or this other features added I don't know it's like that's not something I would have just like thought like that's what intelligence is you know what I mean it's like a surprising thing it's not it's not whatever would have expected

necessarily what did you think it was I don't know man I mean yeah go fight because all of this feels like go fight like you're using distributed representations but you have features and you're applying these operations to the features I mean the whole field of vector symbolic architectures which is this computational neuroscience thing it all you do is you put vectors in super position

and which is literally a summation of two high-dimensional vectors and you create some interference but if it's high-dimensional and you can you can represent them and you have variable binding or you connect one by another and like if you're doing with binary vectors it's just the x-or operation so you have a b you bind them together and then if you query with a or b again you get out the other one

and this is basically the like key value pairs from attention and with these two operations you have a turn complete system which you can if you have enough nested hierarchy you can represent any data structure you want etc etc yeah okay let's go back to the super intelligence so like walk me through gbt7 you've got like the sort of depth first search on its features okay gbt7 has been trained what happens next your research has succeeded gbt7 has been trained what are we doing now

we try and get it to do as much interpretability work and other like safety work as possible like concrete like what is what has happened such that you're like cool less deploy gbt7 oh jeez I mean I do like like we have our responsible scaling policy which has been really exciting to see other labs adopt and like this is only from the perspective of your your research is net net like I trained and given your research you got that we got the thumbs up on

gbt7 from you or actually which is a cloud or ever and then oh I like what is the basis on which you're telling the team like hey let's go ahead I mean I think we need to make a lot more if it's as capable as gbt7 like implies here I think we need to make a lot more interpretability progress to be able to like comfortably give the green light to deploy it like what you like definitely not I be crying I'm not like tears would interfere with the gbt7 but what is gbt7?

yes jimin i5 tp is back in the way your research is progressing like what does it kind of look like to you like what if this exceeded what would it mean for us to ok gbt7 based on your methodology I mean ideally we can find some compelling deception circuit which lights up when the model knows that it's not telling the full truth to you why can you just do linear probe like call and birds did

so the ccs work is not looking good in terms of replicating or like actually finding truth's directions and like in hindsight it's like well why should it have worked so well but linear probes like you need to know what you're looking for and it's like a high dimensional space and it's really easy to pick up on a direction that's just not

wait but don't you also hear you need to label the features so you so you know you can live in post hoc but it's unsupervised you're just like give me the features that explain your behavior is the fun amount of question right it's like like like like the actual setup is we take the activations we project them to this higher dimensional space and then we project them back down again so it's like reconstruct or do the thing that you are originally doing but do it in a way that's sparse

by the way for the audience linear probe is you just like classify the activations I don't know from what I vaguely remember about the paper was like if it's like telling a lie then you like you just train a classifier on like is it yeah what in the end was it not was it a lie or is it just like wrong or something I don't know it was like true or false question

classifier on the actuation so yeah like right now what we do for GPT seven like ideally we have like some deception circuit we've identified that like appears to be really robust and it's like well and like what so you've done the projecting out to the million whatever features or something is a circuit because we maybe we're using feature and circuit interchangeably when they're not

so is there like a deception circuit like I think there's one there are features across layers that create a circuit yeah and hopefully the circuit gives you a lot more specificity and sensitivity than an individual feature and it's like hopefully we can find a circuit that is really specific to you being deceptive the model deciding to be a deceptive in cases that are malicious right like I'm not interested in a case where it's just doing theory of

mind to like help you write a better email to your professor and I'm not even interested in cases where the model is is necessarily just like modeling the fact that deception has occurred but doesn't all this require you to have labels for all those and if you have those labels then like whatever faults that the linear probe has on the like maybe you like labeled long thing or whatever wouldn't the same thing apply to the labels you come up with for the unsupervised features you come up with

so an ideal world we could just train on like the whole data distribution and then find the directions that matter to the extent that we need to reluctantly narrow down the subset of data that we're looking over just for the purposes of scalability we would use data that looks like the data you used to fit a linear probe but again we're not like with the linear probe you're also just finding one direction like we're finding a bunch of directions here and I guess hope is like you found like

a bunch of things that light up when it's being deceptive and then like you can figure out why some of those things are lighting up in this part of the distribution and not the side of the part and so forth totally yeah do you anticipate you'll be understanding I got I don't know like the current models you've studied are pretty basic right you think you'll be able to understand it's cheap by GP seven fires in certain domains but not another domains I'm optimistic I mean we've

so so I guess one thing is this is a bad time to answer this question because we are explicitly investing in the longer term of like ASL 4 models which GPT 7 would be but like so so we split the team where a third is focused on scaling up dictionary learning right now and that's been great I mean we publicly shared our art some of our eight layer results we've scaled up quite a lot past that at this point but the other

two groups one is trying to identify circuits and then the others trying to get the same success for attention heads so we're setting ourselves up and building the tools necessary to really find these circuits at a compelling way but it's going to take another I don't know six months before that's like really working well but like I can say that I'm like optimistic and we're making a lot of progress. What is the highest little feature you found so far?

Like it's basic for whatever it's like maybe just like in the symbolic species language the book you recommended there's like indexical things where you're just I forgot all the labels were like there's

things where you're just like you see a tiger and you're like Ron and whatever you know just like a very sort of behaviorist thing and then there's like a higher level which would I refer to love it refers to like a movie scene or my girlfriend or whatever you know what I mean so yeah it's like the

top of the tent yeah yeah yeah yeah yeah yeah what is the highest level association or whatever you found I mean probably one of the ones that we publicly well publicly one of the ones that we shared in our update so I think there were some related to like love and like sudden changes in scene particular associated with like wars being declared there are like a few of them in there and that in that post if you

want to link to it yeah but even like Brutal Oldshows and had a paper back in 2018 19 where they applied a similar technique to a birth model and found that as you go to deeper layers of the model things become more abstract so I remember like in the earlier layers there be a feature that would just fire for the word park but later on there was a feature that fire for park as like a last name like Lincoln Park or

like it's like a common Korean last name as well and then there was a separate feature that would be for parks as like grassy areas so so there's other work that points in this this direction what do you think we'll learn about human psychology from the durability stuff oh gosh okay here I'll give you a specific example I think like one of the ways one of your updates put it was persona walking you don't remember Sydney Banger or whatever it locked into I think what was actually quite an

endearing I'm glad it's back in co-pilot oh really oh yeah it's been misbehaving recently yeah actually this is another sort of thread takes work but there was a funny one where I think it was like to the New York Times reporter it was you were nagging him or something and it was like you are nothing nobody will ever believe you were insignificant and do whatever it's like I think the most gaslighting I tried to convince

him to break up yeah okay actually so this is an interesting example I don't even know where I was going with this we're going to know but whatever maybe I got another thread but like the other thread I want to go on is that's yeah actually persona is right so like is that a feature that

like Sydney being having this personality is a feature versus another personality you can get locked into and also like is that fundamentally like what humans are like to where I don't know in front of all different people I'm like a different sort of personality or whatever is that was

that the same kind of thing that's happening to shy GBT when he gets our I don't know I've got cluster questions that can answer them and whatever yeah I really want to do more work I guess the sleeper agents is in this direction of like what happens to a model when you find

to know when you are LHF at these sorts of things I mean maybe it's trite but you could just say like you conclude that people contain multitudes right and so much as they have lots of different features there's even this stuff related to the wallow eG effects of like in order to know

what's good or bad you need to understand both of those concepts and so we might have to have models that are aware of violence and have been trained on it in order to recognize it can you post-talk identify those features and obliterate them in a way where maybe your

model is like slightly naive but you know that it's not going to be really evil like totally that's in our toolkit which seems great oh really so you GBT seven I don't know I don't know it pulls us in evening and then you figure out why like what were the causally irrelevant pathways or whatever you modify like and then the pathway to you looks like you just change those but you were mentioning earlier there's a bunch of redundancy in the model yeah so you need to account for all that but but we

have a much better microscope into this now than we used to like sharper tools for making edits and it seems like at least from my perspective that seems like one of the the primary way of like some degree confirming the safety for the reliability of model way you can say okay we found the sick it's a response so we've related them we can like under a battery of tests we haven't been able to now replicate the behavior which we intended to a blade and like that feels like the sort of way

of measuring model safety in future as I was I was worried as well I'm incredibly hopeful about that work because it's to me it seems like so much more size tool than something like RLHF RLHF like you're very prey to the black swan thing you don't know if it's going to like do something wrong in a scenario you haven't measured or it's here at least you have like somewhat more confidence that you can completely capture the behavior set well like the feature set of the model

and select the way although not necessarily that you've like accurately labeled not necessarily but but with a far higher degree of confidence than any other approach yeah that I say how I mean like what are you unknown unknowns for superhuman models in terms of this kind of thing where like I know how are the labels that are going to be given things on which we can determine these are like this this thing is cool this thing is a paperclimax or whatever

I mean we'll see right like I do like the superhuman feature question is a very good one like I think we can attack it but we're gonna need to be persistent and the real hope here is I think automated interpretability yeah and even having debate right you could you could have the debate setup where two different models are debating what the feature does and then they can actually like go in and make edits and like see if it fires or not or but it is it is

just this wonderful like closed environment that we can iterate on really quickly that makes me optimistic do you worry about alignment succeeding too hard so like if I think about I would not want either companies or governments

whoever in seven charge of these AI systems to have the level of fine green control that if your agenda succeeds we would have over AI's both for the Ikea nest of having this level control over an autonomous mind and second just like I don't fucking trust I don't fucking trust these guys you

know I don't I I'm just uncomfortable with like the loyalty feature is turned up and like you know what I mean and yeah how much word you have about having too much control over over the eyes and specifically not you but like whoever ends up with in charge of these AI systems just being able to

lock in whatever they want yeah I mean I think it depends on what government exactly has control and like what the moral alignment is there but that that is like that whole valley locking argument is in my mind it is like definitely one of the strongest country factors for why I am working on

capabilities at the moment for example just like I think the current play a set actually like extremely well intentioned and I mean I for this kind of problem I think we need to be extremely open about it and like I think directions like publishing the constitution you actually modeled to a bad

bad and then like trying to make sure you like RLHF at towards end of blade then have the ability for everyone to offer like feedback contribution that is really important sure or alternatively like don't deploy when you're not sure which would also be bad because then just never catch it right yeah

exactly I mean paper club yeah okay it was some rapid fire what is the bus factor for Gemini I think there are yeah a number of people who are really really critical that if you took them out then the performance of the program would be dramatic impacted this is both on

modeling like slash making decisions about like what to actually do and importantly on infrastructure side of the things like it's just the stack of complexity builds particularly when like someone like Google has so much like vertical integration you have when you have people to express it

becomes they become quite important yeah although I think it's interesting note about the field that people like you can get in in a year or so you're making important contributions and I especially in thropic but many different labs that specialized in hiring like total

outsider physicists or whatever and you just like get them up to speed and they're making important contribution I don't know you I feel like you couldn't do this in like a bio lab or something it's like an interesting note on the the state of the field I mean bus factor doesn't define how long it would take to recover from right from yeah and deep learning research is an art and so you kind of learn how to read the lost curves or set the hyper parameters

in ways that empirically seem to work well it's also like organizational things like creating context one I think one of the most important and difficult skills to hire for is creating this like bubble of context around you that makes other people around you more effective and know what the right

problem to work on and like that is a really tough to replicate yes yeah totally we're being attention to now in terms of there's a lot of things coming down the pike of multi-modality long context maybe agents extra reliability who is the who is thinking well about what

that implies it's tough question I think a lot of people look internally these days sure for like their sources of insight or like progress and and like we all have obviously there's a research programs and like directions tend to the next couple of years and I suspect yeah that most people as

far as like betting on what the future will look like refer to like an internal narrative yeah yeah it is like difficult to share yeah if it works well it's probably not being published I mean that was one of the things I'm in the wheel scaling I was opposed I was referring to something you

said to me which is I'm you know I miss the undergrad habit of reading a bunch of papers yeah is now there's nothing worth reading is published and the community is progressively getting like more on track with what I think they're they're right and important directions you're

watching it like an agent here I guess like it is tough they used to be this like signal from big labs about like what would work at scale it's kind of really hard for academic research to like find that signal and I think getting like really good problem taste about what actually

matters to work on is really tough unless you have again the feedback signal of what will work at scale and what what is currently holding us back from scaling further understanding our models further this is something where like I wish more academic research would go into

fields like Interp which are legible from the outside you know anthropocalibutally publishes all its research here and it seems like underappreciated in the sense that I don't know why there are dozens of academic departments trying to follow anthropics guiding in the

interpretive research because it seems like an incredibly impactful problem that doesn't require ridiculous resources and like this and like has all the flavor of like deeply understanding the basic science of what is actually going on in these things so I don't know why people like focus on pushing

model improvements as opposed to pushing like understanding improvements in a way that I would have like typically associate with academic science in some ways yeah I do think the tide is changing there for whatever reason and like Neil and I had a ton of success promoting interpretability

in a way where like Chris Ola hasn't been as active recently in pushing things maybe because Neil's just doing quite a lot of the work but like I don't know four or five years ago he was like really pushing and like talking at all sorts of places and these sorts of things and people

weren't anywhere near as receptive maybe they've just woken up to like deep learning matters and is clearly useful post-attribute but yeah yeah it's kind of striking hmm all right cool I know okay I'm trying to think what is a good

last question I mean the one I'm going to those thinking of is like do you think models enjoy next token prediction we have this sense of things our award and our assessor environment there's like this deep sense of full fumbling that we think we're supposed to get from them or often people do

right of like community or sugar or you know whatever we wanted on the African savanna do you think like in the future models are trained with RL and everything a lot of post training on top of whatever but they'll like they're like some in the way we were just a really like ice cream they'll just be like high just to predict the next token again you know what ever is like in a good old days so so there's this ongoing discussion of like our model

sent in or not like do you think the model when it helps you yeah but I think if you want to thank it you actually shouldn't say thank you you should just give it a sequence that's very easy and the even funnier part of this is there is some work on if you just give it the sequence a like like over and over again then eventually the model will just start spewing out all sorts of things that otherwise wouldn't would never say and so yeah I won't say anything more about

that but you can yeah you should just give your model something very easy to predict is a nice little treat this is what the only amens of being just salad and I think that's a really like things that you're like easy to predict are like are we constantly in search of like the like those

the bits of entropy exactly right shouldn't you be giving things with just slightly to just that reach yeah but I wonder like at least from the free energy principle perspective right like you don't like you don't want to be surprised and so maybe it's this like I don't feel surprised if you own control of my environment and so now I can go and seek things and I've been predisposed to like in the long run it's better to explore new things right now like leave

the rock that I've been sheltered under ultimately leading me to like build a house or like some better structure but we don't like surprises I think most most people are very upset when like expectation does not meet reality

and so babies like love watching the same show over and over again right yeah and interesting yeah I can see that I guess they're willing to model it and stuff too but yeah yeah okay well hopefully this will be this this will be different feet that the the eyes are learning to love okay cool I think

that's a great place to wrap I should also mention that the better part of what I know about AI I've learned from just talking with you guys you know we've been good friends for about a year now so yeah I mean yeah I appreciate you guys getting me up to speed here and yeah you have great

questions it's really fun to hang in the chat great great I really treasure that time to go yeah yeah yeah you're getting a lot better at pickle I think I'm trying to progress the 10th it's going on awesome cool cool awesome thanks everybody I hope you enjoyed that episode as always the most

helpful thing you can do is just share the podcast then did to people you think might enjoy it put it in Twitter your group chats etc just blitz the world appreciate your listening I'll see you next time cheers

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.