Dario Amodei (Anthropic CEO) - Scaling, Alignment, & AI Progress - podcast episode cover

Dario Amodei (Anthropic CEO) - Scaling, Alignment, & AI Progress

Aug 08, 20232 hr 59 min
--:--
--:--
Listen in podcast apps:

Episode description

Here is my conversation with Dario Amodei, CEO of Anthropic.

Dario is hilarious and has fascinating takes on what these models are doing, why they scale so well, and what it will take to align them.

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.

Timestamps

(00:00:00) - Introduction

(00:01:00) - Scaling

(00:15:46) - Language

(00:22:58) - Economic Usefulness

(00:38:05) - Bioterrorism

(00:43:35) - Cybersecurity

(00:47:19) - Alignment & mechanistic interpretability

(00:57:43) - Does alignment research require scale?

(01:05:30) - Misuse vs misalignment

(01:09:06) - What if AI goes well?

(01:11:05) - China

(01:15:11) - How to think about alignment

(01:31:31) - Is modern security good enough?

(01:36:09) - Inefficiencies in training

(01:45:53) - Anthropic’s Long Term Benefit Trust

(01:51:18) - Is Claude conscious?

(01:56:14) - Keeping a low profile



Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Transcript

A generally well-educated human That could happen in, you know, two or three years What is that implied for Anthropic when in two to three years these Leviathan's are doing like $10 billion trading runs The models they just want to learn and it was a bit like a zen coin I listened to this and I became like The compute doesn't flow like the spice doesn't flow Yeah it's like you can't like like the the blob has to be unencumbered right the big acceleration

that that happened late last year and beginning of this year we didn't cause that and honestly I think if you look at the reaction to Google that that might be 10 times more important than anything else There was a running joke the way building AGI would look like is you know there would be a data center next to a nuclear power plant next to a bunker but now it's 2030 what happens next what what are we doing with a superhuman god okay today I have the pleasure of speaking with Dario Amodei who

is the CEO of Anthropic and I'm really excited about this one Dario thank you so much for coming on the podcast thanks for having me first question you have been one of the very true people who has seen scaling coming for years more than five years I don't know how long it's been but

as somebody who's seen it coming what is fundamentally the explanation for why scaling works why is the universe organized such that if you throw big blobs and compute at a wide enough distribution of data the thing becomes intelligent I think the truth is that we still don't know I think it's

almost entirely an empirical fact you know I think it's a fact that you could kind of sense from the data and from a bunch of different places but I think we don't still have a satisfying explanation for it if I were to try to make one but I'm just I don't know I'm just kind of waving

my hands when I say this you know there's this there's these ideas in physics around like long tail or power law of like correlations or effects and so like when a bunch of stuff happens right when you have a bunch of like features you get a lot of the data in like kind of the early

you know the the fat part of the distribution before the tails you know for language this would be things like oh I figured out there are parts of speech and now it's follow verbs and then there these more and more and more and more subtle correlations and so it kind of makes sense why there

would be this you know every log or order of magnitude that you add you kind of capture more of the distribution what I what's not clear at all is why does it scale so smoothly with parameters why does it scale so smoothly with the amount of data why are you you can think up some explanations

of why it's linear like the parameters are like a bucket and so the data is like water and so size of the bucket is proportional size of the water but like why does it lead to all these this very smooth scaling I think we still don't know there's all these explanations are chief

scientists garrard capplin did some stuff on like fractal manifold dimension that like you can use to explain it so there's there's all kinds of ideas but I feel like we just don't really know for sure and by the way for the audience who's trying to follow along by scaling we're referring to the fact that you can very predictably see how to go from gbd3 to gbd4 or in this case cloud 1 to cloud 2 that the loss in terms of whether it can predict the next token scales very smoothly so okay

we don't know why it's happening but can you at least predict empirically here is the loss at which this ability will emerge here is the place where this circuit will emerge is is that it's operative or are you just looking at the loss number is much less predictable what's predictable is

this statistical average this loss this entropy and it's super predictable it's like you know predictable to like sometimes even to several significant figures which you don't see outside of physics right you don't expect to see it in this messy empirical field but actually specific

abilities are very hard to predict so you know back when I was working on gpt 2 and gpt 3 like when does arithmetic come in place when do models learn to code sometimes it's very it's very abrupt you know it's kind of like you can predict statistical averages of the weather but the weather on

one particular day is very you know very very hard to predict so I don't know from you I don't understand manifolds but mechanistically it doesn't know addition yet now it knows addition yeah what has happened this is another question that we don't know the answer to I mean we're

trying to answer this with things like mechanistic interpretability but you know I'm not sure I mean you can think about these things about like circuits snapping into place although there is some evidence that when you look at the models being able to add things that you know like if you

look at its chance of getting the right answer that shoots up all of a sudden but if you look at okay what's the probability of the right answer you'll see it climbed from like one in a million to one hundred thousand to one in a thousand long before it it actually gets the right answer

and so there's some continue in many of these cases at least I don't know if in all of them there's some continuous process going on behind the scenes I don't understand it at all the does that imply that the circuit or the process for doing addition was pre-existing and it just

got increased in sale and I don't know if like there's this circuit that's weak and getting stronger I don't know if it's something that works but not very well like I think we don't know and these are some of the questions we're trying to answer with mechanistic interpretability are there

abilities that won't emerge with scale so I definitely think that again like things like alignment and values are not guaranteed to emerge with scale right it's it's kind of like you know one way to think about it is you you train the model and it is basically it's like predicting the world it's understanding the world it's job is facts not values right it's trying to predict what comes next but there's there's just there's free variables here where it's like it what should you do what should

you think what should you value those you know like they're they're just there aren't the bits for that there's just like well if I started with this I should finish with this if I started with this other thing I should finish with this other thing and so I think that's not going to emerge

I want to talk about element in a second but on scaling every turns out that scaling plateaus before we reach human level intelligence looking back on it what would be your explanation what do you think is likely to be the case if that turns out to be the outcome yeah so I guess I

would distinguish some problem with the fundamental theory with some practical issue so one practical issue we could have is we could run out of data for various reasons I think that's not going to happen but you know if you look at it very very naively we're not that far from right now data

and so it's like we just don't have the data to continue the to continue the scaling curves I think a you know another way it could happen is like oh we just we just use up our all of our compute that was available and that that wasn't enough and then progress is slow after that I wouldn't

bet on either of those things happening but they they could I I think from a from a fundamental perspective I personally I think it's very unlikely that the scaling laws will just stop if they do another reason again this isn't fully fundamental could just be we don't have quite the right

architecture like if we tried to do it with an LSDM or an RNN the slope would be different I still might be that we get there but I think there are some things that are just very hard to represent when you don't have this ability to attend far in the past that transformers have if somehow and

I don't know how we would know this it kind of wasn't about the architecture and we just hit a wall I think I'd be very surprised by that I think we're already at the point where the things the models can't do don't seem to me to be different and kind from the things they can do

and it just you know you could have made a case a few years ago that it was like they can't reason they can't program like you could have you could have drawn boundaries and said well maybe I'll hit a wall I didn't think that I didn't think we would hit a wall a few other people didn't think we would hit a wall but it was a more plausible case that I think it's a less plausible case now now it could it could happen like this stuff is crazy like it could it could it could happen

tomorrow that it's just like we hit a wall I think if that happens I'm trying to think of like what's my what would really be my it's unlikely but what would really be my explanation I think my explanation would be there's something wrong with the loss when you train on next word prediction

like some of the remaining like reasoning abilities or something like that like if you really want to learn you know to program at a really high level like it means you care about some tokens much more than others and they're rare enough that it's like the loss function over focuses on

kind of the the appearance the things that are responsible for the most bits of entropy uh and instead you know they don't focus on this stuff that's really essential and so you could kind of have the signal drowned out in the noise I don't think it's going to play out that

way for a number of reasons but if if you told me yep you trained your 2024 model it was much bigger and it just wasn't any better and you tried every architecture and didn't work that I think that's the explanation I would I would reach for is there a candidate for another loss function if you

had a abandoned next second prediction I think then you would have to go for some kind of RL and again there's you know there's many different kind there's RL from you and feedback there's RL against an objective there's things like constitutional AI there's things like amplification and debate

right these are kind of both alignment methods and ways of training models you would have to try a bunch of things but the focus would have to be on what do we actually care about the model doing right in the sense we're a little bit lucky that it's like predict the next word gets us

all these other things we need right there's no guarantee it seems like from your worldview there's a multitude of different loss functions that it just a matter of what can allow you to just throw a whole bunch of data at it like the next token prediction itself is not significant yeah

I well I mean I guess the thing with RL is you get slowed down a bit because it's like you know you you have to buy some method kind of you know design how the loss function works nice thing with the next token prediction is it's there for you right it's just there it's the easiest thing in the

world and so I think it would slow you down if you couldn't scale in just that very simpliously you mentioned that the data is likely not to be the constrained well why do you think that is the case there's various possibilities here and you know for a number of reasons I shouldn't go into

the details but you know like there's many sources of data in the world and there's many ways that you can also generate data my my guess is that this will not be a blocker maybe we better if it was but it won't be are you talking about the model or there's just many different ways to do it

how did you form your views on scaling how far back can we go and then you would be basically saying something similar to this this view that I have probably formed gradually from I would say like 2014 to 2017 so I think my first experience with it was my first experience with AI so I

you know I saw some of the early stuff around Alex Ned in 2012 always kind of wanted to study intelligence but I you know before I was just like this isn't really working like it doesn't doesn't seem like it's actually working you know all the back to like you 2005 I'd like you know I'd

read Rick Hertzwell's work you know I'd read even even some of like Elias or his work on the early on the early internet back then I was like oh this this stuff kind of looks far away like I look at the AI stuff of today and it's like not not anywhere not anywhere close but with Alex and I was like

oh this is actually starting to work so I joined Andrew Ings group initially at by do and the first task you know that I got set to do right it was my you know I'd be in a different field and so I first joined you know this was my first experience with AI and it was a bit different

from a lot of the kind of academic style research that was going on kind of elsewhere in the world right the I think I kind of got lucky in that the task that was given to me and the other folks there was just make the best speech recognition system that you can and there was a lot of data

available there were a lot of GPUs available so it kind of it it posed the problem in a way that was amenable to discovering that kind of scaling was a solution right that's very different from like you're a postdoc and it's your job to come up with you know what's the what's the best like you know

what's what's an idea that seems clever and new and make sure Marcus someone who's invented something and and so I just quickly discovered that like you know I was just just tried the simplest experiments I was like you know just feeling with some dials I was like okay try um you know try

try adding more layers to the art literally add more layers to the RNN you know try training it for longer what happens how long does it take to over fit what if I add new data and repeat it less times and like I just saw these like very consistent patterns I didn't really know this was

an usual or that others weren't thinking in this way this this was just kind of like almost like beginners luck it was my first experience with it and I didn't really think about it beyond speech recognition right right right you know I was I was just kind of like oh this is you know I don't

know anything about this field there's aliens of things people do with machine learning but like I'm like weird this seems to be true in the speech recognition field um and and and then I think it was recently you know like um just before open AI started um that I met Ilya who you

who you interviewed one of the first things he said to me was look the models they just want to learn you have to understand this the models they just want to learn and it was a bit like a zen Cohen like I kind of like I listened to this and and I became enlightened

and uh you know over over the over the years after this you know you know again I would be kind of you know the one who would formalize a lot of these things and kind of put them together but like just kind of the what that told me is that that that phenomenon that I'd seen wasn't just some

rain and thing that I'd see it was like it was broad it was it was more general right the models the models just want to learn you get the obstacles out of their way right you give them you give them good data you you give them enough space to operate in you don't do something stupid like

condition them badly numerically um and they want to learn they'll do it they'll do it you know what you know what I found really interesting about what you said is there are many people who were aware back at that time probably weren't working on it directly but were aware that these

things are really good at speech recognition or at playing these constrained games very few extrapolated from there like you and Ilya did to something that is generally intelligent what was different about the way you were thinking about it versus how others think that you went

from like is getting better at speech in this consistent way it will get better at everything in this consistent way yeah so I I genuinely don't know I mean at first when I saw it for speech I assumed this was just true for speech or for this narrow class of models I I think it was just over

the period between 2014 and 2017 I tried it for a lot of things and saw the same thing over and over again I watched the same being true with Dota I watched the same being true with robotics which many people thought of as a counter example but I just thought what's hard to get data for robotics but

if we operate within if we look within the data that we have we see the same patterns and so I don't know I think people were very focused on solving the problem in front of them why one person thinks one way and other person thinks it's very it's very hard to explain I think people just

see it through a different lens you know are looking like vertically instead of horizontally they're not thinking about the scaling they're thinking about how do I solve my problem well for robotics there's not enough data and so you know and and so you know that can easily abstract

will scaling doesn't work because we don't have the data and and so I don't I don't know I just for some reason and it may just it may just have been random chance was obsessed without particular direction when did it become obvious to you that language is the means to just feed a bunch of

data into these things that or was it just you ran out of other things like robotics there's not enough data this other thing there's not enough data yeah I mean I think this whole idea of like the next word prediction that you could do self-supervised learning you know that together with the

idea that it's like wow for predicting the next word there's so much richness and structure there right you know it might say two plus two equals and you have to know the answer is four and you know it might be telling the story about a character and then basically it's it's posing to the model

you know the the equivalent of these developmental tests that get posed to children you know Mary walks into the room and you know puts an item in there and then you know Chuck walks into the room and moves the item and Mary Mary doesn't see it what does Mary think half you know so like

so the models are gonna have to to get this right in the service of predicting the next word they're gonna have to solve you know solve all these theory of mind problem solve all these math problems and so I you know I my thinking was just well you know scale it up as much as you can you you

you know there's there's kind of no limit to it and I think I kind of had abstractly that view but the thing of course that like really solidified and convinced me was the work that Alakrad for did on GPT one which was not only could you get this this language model that could predict

things very well but also you could fine tune it you needed to fine tune it in those days to do all these other tasks and so I was like wow you know it this isn't just some narrow thing where you get the language model right it's sort of halfway to everywhere right it's like you know you get the

language model right and then with the little move in this direction it can you know it can solve this you know logical dereference tests or whatever and you know with this this other thing you know it can it can solve translation or something and then you're like wow I think there's there's

really something to do it and of course we can it can really scale it well one thing that's confusing or that would have been hard to see if you told me in 2018 we'll have models in 2023 like what law two that can write theorems in the style of Shakespeare whatever theory you want

you want they can a standardized test with open-ended questions you know just all kinds of really impressive things you would have said at that time I would have said oh you have AGI you clearly have something that is a human level intelligence where these while these things are impressive it

clearly seems we're not at human level at least in the current generation and potentially for generations to come what explains this discrepancy between super impressive performance in these benchmarks and in just like the things you could describe versus yeah general that that was one

area where actually I was not pressing and I was surprised as well yeah um so when I first looked at GPT-3 and you know more so the kind of things that we built in the early days at at Anthropic my my general sense was I you know I looked at these and I'm like it seems like they they really

grasped the essence of language I'm not sure how much we just scale them up like maybe we maybe what's what's more needed from here is like RL and all and kind of and kind of all the other stuff like we might be kind of near the you know I thought in 2020 like we can scale this a bunch more

but I wonder if it's more efficient to scale it more or to start adding on these other objectives like like RL I thought maybe if you do as much RL as you know as as you've done pre-training for for you know 2020 style model that that's that's the way to go and scaling it up will keep working but

you know is that is that really the best path and I I think it I don't know it just keeps going like I thought it had understood a lot of the essence of language but then you know there's there's kind of there's kind of further to go um and and so I don't know stepping back from it like

one of the reasons why I'm sort of very impure assist about about AI about safety about organizations is that you often get surprised right I you know I feel like I've been right about some things but I've still you know with these theoretical pictures had been wrong about most things

being right about 10% of the stuff is you know sets you head and shoulders above above above above many people you know if you look back to I can't remember who it was kind of you know made these diagrams that are like you know here's here's the village idiot here's Einstein here's the scale

of intelligence right and the village idiot and Einstein are like very close to each other like that maybe that's still true in some abstract sense or something but it's it's not really what we're seeing is it we're seeing like that it seems like the human range is pretty broad

and doesn't we don't hit the human range in the same place or at the same time for different tasks right like you know like write write a sonnet you know in the style of Kormick McCarthy or something like I don't know I'm not very creative so I couldn't do that but like you know that's that's a

pretty high level human skill right and even the model is starting to get good at stuff of you know like constrained writing you know there's like write a you know write a page without using the letter E or something like very to page about X without using the letter E like I think their

models might be like superhuman or close to superhuman at that but when it comes to you know I yeah I don't know prove relatively simple mathematical theorems like they're they're just starting to do the beginning of it they make really dumb mistake sometimes and they they really lack

any kind of broad like you know correcting your errors or doing some extended task and so I don't know it turns out that intelligence isn't isn't a spectrum there are a bunch of different areas of domain expertise there are a bunch of different like kinds of skills like memories

different it's all it's all formed in the blob it's not it's all formed in the blob it's not complicated but to the extent it even is on the spectrum the spectrum is also wide if you ask me 10 years ago that's not what I would have expected at all but I think that's very much the way to turn

out oh man I have so many questions just as follow up on that one is do you expect that given the distribution of training that these models get from massive amounts of internet data versus what humans got from evolution that the repertoire of skills that elicits will be just barely

overlapping it will be like concentric circles how do you think about do those matter clearly there's a large there's certainly a large amount of overlap right because a lot of the you know like these models have have business applications and many of their business

applications are doing things that you know we're helping humans to be more effective at things so the overlap is quite is quite large and you know if you think of all the activity that humans put on the internet in tax that covers a lot of it but it probably doesn't cover some things like

the models I think they do learn a physical model of the world to some extent but they certainly don't learn how to actually move around in the world again maybe that's easy to fine tune but I you know I think so I think there are some things that the models don't learn that humans do

and then I think you know the models learn for example to speak fluent base 64 I don't know about you but I never learned that right how likely do you think it is that these models will be superhuman for many years at economically valuable tasks while there are still below humans in

many other relevant tasks that prevents like an intelligence explosion or something I think this kind of stuff is like really hard to know so I'll give I'll give that caveat that like you know again like the basic scaling laws you can kind of predict and then like this more granular stuff

which we really want to know to know how this all is going to go is much harder to know but my guess would be the scaling laws are going to continue you know getting subject to you know do people slow down for safety or for regulatory reasons but you know let's just let's just put all that aside

and say like we have the economic capability to keep scaling if we did that what would happen and I think my view is we're going to keep getting better across the board and I don't see any area where the models are like super super weak or not starting to make progress like that used to be

true of like math and programming but I think over the last six months you know the the 2023 generation of models compared to the 2022 generation had started to learn that there may be more subtle things we don't know and so I kind of suspect even if isn't quite even that the rising tide will

lift all the boats does that include that thing you were mentioning earlier where if there's an extended task it kind of loses its train of thought or it's a validity to just like execute a series of stuff so I think that that that's going to depend on things like RL training to have the model do

longer horizon tasks I don't expect that to require a substantial amount of additional compute I think that that that was probably an artifact of yeah kind of thinking about RL in the wrong way and underestimating how much the model had learned on its own in terms of you know are we

going to be superhuman in some areas and not others I think it's complicated I could imagine that we won't be superhuman in some areas because for example they involve like embodiment in the physical world and then it's like what happens like do the ais help us train faster ais and those faster

ais wrap around and solve that do you not need the physical world it depends what you mean are we worried about an alignment disaster are we worried about misuse like making weapons of mass destruction are we worried about the AI you know the AI taking over research from humans are we worried

about it reaching some threshold of economic productivity where it can do what the average it these different thresholds I think have have different answers oh I suspect they will all come within a few years let me ask about those thresholds so if cloud was an employee at Anthropic what's

salary would be worth what is it like meaningfully speeding up AI progress it feels to me like an intern in most areas but then some specific areas where it's better than that again I think one thing that makes the comparison hard is like the form factor is kind of like not the same as a human

right like a human like you know if you were to behave like one of these chatbots like we wouldn't really I mean I guess we could have this conversation it's like but you know they're not really they're more designed to answer single or a few questions right and like you know they don't have

a the concept of having a long life of prior experience right we're talking here about you know things that that I've experienced in the past right and chatbots don't don't have that and so there's there's all kinds of stuff missing and so it's hard to make it comparison but I don't

know it they feel like interns in some areas and kind of then they have areas where they spike and are really savants where they may be better than they may be better than anyone here but does it overall picture of something like an intelligence explosion you know my my former guest is Carl

Shulman and he has this like very detailed model of an intelligence does that as somebody who it would actually like see that happening does that make sense to you as they go from interns to entry level software engineers those entry levels of range engineers increase your productivity I think

I think the idea that the the AI systems become more productive and first they speed up the productivity of humans then they you know kind of equal the productivity of humans and and you know and then they're in some meaningful sense the main contributor to scientific progress that that

happens at some point I think that that basic logic seems likely to me although I have a suspicion that when we actually go into the details it's going to be kind of like weird and different than we expect that all the detailed models are kind of you know we're thinking about the wrong things or

write about one thing and then are wrong about 10 other things and and so I I don't know I think we might end up in like a weirder world than we expect what do you add all this together like your estimate of when we get something kind of human level yeah what does that look like I mean

again it depends on the thresholds yeah um you know in terms of someone looks at these the model and you know even if you talk to it for you know for for an hour or so it's it's basically you know it's basically like a generally well educated human yeah um that could be not very far away

at all I think um like that that could happen in you know two or three years like uh you know if I look at again like I think the main thing that would stop it would be if if we hit certain you know and we have internal tests for you know safety thresholds and stuff like that so if a

company or the industry decides to slow down or you know we're able to get the government institute restrictions that kind of uh you know that that moderate the rate of progress for safety reasons that would be the main reason it wouldn't happen but if you if you just look at the logistical

and economic ability to scale I don't think we're very far at all from that now that that may not be the threshold where the models are extensively dangerous in fact they suspect it's not not quite there yet it may not be the threshold where the models can take over most AI research

it may not be the threshold where the models you know seriously change how the economy works um I think it gets a little murky after that and all those thresholds may happen at at various times after that but I think I think in terms of the base technical capability of it it kind of it kind of

it kind of sounds like a reasonably generally educated human yeah across the board I think that could be quite close why would it be the case that it could be sound you know pass it during test for an educated person but not be able to contribute or substitute for a human involvement in the

economy a couple reasons one is just you know that the threshold of skill isn't high enough right comparative advantage it's like uh it like doesn't matter that you know I have someone who's better than the average human at every task like what I really need is like for for AI research like

you know I need what you know I I need to basically find something that is is strong enough to substantially accelerate you know the the like labor of the thousand experts who are best at it and so we might reach a point where we you know the comparative advantage of these systems is not

is not great uh another thing that could be the case is that I think there are these kind of mysterious frictions that like you know kind of don't show up in naive economic models but you see it whenever you're like you know when you go to a customer or something and you're like

hey I have this cool chatbot in principle it can do everything that you know your customer service bot does or that this part of your company does but like the the actual friction of like how do we slot it in how do we make it work that that includes both kind of like you know just the question of

how it works in a human sense within the company like you know how how how things happen in the economy and overcome frictions and and also just like what is the workflow how do you actually interact with it it's very different to say here's a chatbot that kind of looks like it's doing

this task that you you're you know or helping the human to do to do some task as it as it is to say like okay this thing is this thing is deployed and a hundred thousand people are using it often like right now lots of folks are rushing to deploy these systems but I think in many cases they're not

using them in anywhere close to the most efficient way that they could you know not because they're not smart but because it takes time to work these things out and so I think when things are changing this fast they're going to be all of these frictions yeah and I and I think again these are messy

reality that doesn't quite get captured in the model I don't think it changes the basic picture like I don't think it changes the idea that we're we're building up this snowball of like the models help the models get better and you know do what the humans and you know can can accelerate what

the humans do and eventually it's mostly the models doing the work like you zoom out far enough that's happening but I'm I'm kind of skeptical of kind of any kind of precise mathematical or exponential prediction of how it's going to be I think it's I think it's I think it's all going to be a

mess but I think what we know is it's not an ex metaphorical exponential and it's going to happen fast how do those different exponentials net out which we've been talking about so one was that the scaling laws themselves are power laws with decaying marginal you know loss per you know parameter

or something the other exponential you talked about is well these things can get involved in the process of AI research itself speeding it up so those two are sort of opposing exponentials does it net out to be super linear or sub linear and also you mentioned well the distribution of

intelligence might just be broader so it should be expect after the after we get to this point into two to three years it's like boom boom like what does that look like it's I mean I think it's very unclear right so we're already at the point where if you look at the loss the scaling laws

are starting to bend I mean we've seen that in you know published model cards offered by multiple companies so that's not a secret at all but as as they start to bend each little bit of of entropy right of accurate prediction becomes more important right maybe these last little bits bits of

entropy are like well you know this is a physics paper as Einstein would have written it as opposed to you know as some other physicist would have would have would have written it and so it's it's hard to assess significance from this it certainly looks like in terms of practical performance the

metrics keep going up relatively linearly hopefully they're always unpredictable so it's it's hard to see that and then I mean the thing that I think is driving the most acceleration is just more and more money is going into the field like people are seeing that there's just a huge amount

of you know of of of economic value and so I expect the price the amount of money spent on the largest models to go up by like a factor of a hundred or something and for that that then to be concatenated with the chips are getting faster the algorithms are getting better because there's

there's so many people working on this now and so and so again I mean that you know I'm not making a normative statement here this is what should happen I'm not even saying this necessarily will happen because I think there's important safety and government questions here which were

very actively working on I'm just I'm just saying like left to itself this is what the economy is going to do we'll get to those questions in a second but how do you think about the contribution of enthropic to that increasing in the scope of this industry where I mean that there's an argument we make that listen with that investment we can work on safety stuff and enthropic another that says you're raising the salience of this field in general yeah I mean it's all it's all costs and benefits

right the costs are not zero right so I think I'm a sure way to think about these things is you know not not to deny that they're already costs but to think about what the costs are and what the benefits are you know I think I think we've been relatively responsible in the sense that you know

the big acceleration that that happened late last year and beginning of this year like we didn't cause that we weren't we weren't the ones who did that and honestly I think if you look at the reaction of Google that that might be 10 times more important than than anything else and then

kind of once it had happened once the ecosystem had changed then we did a lot of things to kind of to kind of stay on the frontier and and so I don't know it's I mean it's like any other question right it's like you're trying to you're trying to do the things that have the biggest costs and

the have the lowest costs and the biggest benefits and you know that that causes you to have different strategies at different times one question I have for you while we're talking about the intelligent stuff was listen as a scientist yourself is it what do you make of the fact that these

things have basically the entire corpus of human knowledge memorized and as far as I'm aware they haven't been able to make it like a single new connection that has led to a discovery whereas if even a moderately intelligent person had this much stuff memorized they'd notice

oh this thing causes this symptom this other thing also causes this symptom you know there's a medical cure right here right what what what shouldn't we be expecting that kind of stuff I'm not I'm not sure I mean I think you know I don't know these words discover creativity like it's one

of the lessons I've learned is that in you know in kind of the big blob of compute often these these ideas often end up being kind of fuzzy and elusive and hard to track down but I think I think there is something here which is I think the models do display a kind of ordinary creativity

again again you know the kind of like you know write a write a sonnet you know in the style of corn mccard theory or barbie or so you know like there is some creativity to that and I think they do draw you know new connections of the kind that an ordinary person would draw I I agree with you

that there haven't been any kind of like I don't know like I would say like big scientific discoveries I think that's a mix of like just the model skill level is not is not high enough yet right like I was on a podcast last week where where the host said I don't know I play with these

models they're kind of mid right like they get you know they get a B or a B- or something and that that I think is going to change with the with the scaling I do think there's an interesting point about well the models have an advantage which is they know a lot more than us you know like

should should they have an advantage already even if even if they their skill level isn't isn't quite high maybe that's kind of what you're getting at I don't really have an answer to that I mean it seems certainly like memorization and facts and drawing connections is an area where the models are

ahead and I I do think maybe you need those connections and you need a fairly high level of skill I do think particularly in the area of biology for better and for worse the complexity of biology is such that the current models know a lot of things right now and that's what that's what you

need to make discoveries and draw it's not like physics where you need to you know you need to think and come up with a formula biology you need to know a lot of things right and so I do think the models know a lot of things and they have a skill level that's not quite high enough to put them

together and I think they are they are just on the cusp of being able to put these things together on that point last week in your senate testimony you said that these models are two to three years away from potentially enabling large scale bioturism attacks or something like that can you make that

more concrete without obviously giving the kind of information that would but is it like one shotting how to weaponize something is it or do you had a fine tune in open source model like what would that actually look like I think it'd be good to clarify this because we did a blog post in

the senate testimony and like I think various people kind of didn't understand the point or didn't didn't understand what we'd done so I think today and you know of course in our models we try and you know prevent this but there's always jail breaks you can ask the models all kinds of things

about biology and get them to say all kinds of scary things yeah but often those scary things are things that you could google and I'm I'm therefore not particularly worried about that I think it's actually an impediment to seeing the real danger where you know someone just says oh I asked this

model like you know for the small pot you know for tell me some things about smallpox and it will that that is actually you know kind of not what I'm worried about so we spent about six months working with some of basically some of the folks who are the most expert in the world on

how do biological attacks happen you know what what would you need to conduct such an attack and how do we defend against such an attack they worked very intensively on just the entire workflow of if I were trying to do a bad thing it's not one shot it's a long process there are many steps

to it it's not just like I asked the model for this one page of information and again without going into any detail the thing I said in the the senate testimony is like there's some steps where you can just get information on Google there are some steps that are what I'd call missing

they're scattered across a bunch of textbooks or they're not in any textbook they're kind of implicit knowledge and they're not really like they're not explicit knowledge they're they're they're more like I have to do this lab protocol and like what if I get it wrong oh if this happens

then my temperature was too low if that happened I needed to add more of this particular reagents what we found is that for the most part those missing those key missing pieces the models can't do them yet but we found that sometimes they can and when they can sometimes they still hallucinate

which is the thing that's that's kind of keeping us safe but we saw enough signs of the models doing doing those those key things well and if we look at you know state-of-the-art models and go backwards to previous models we look at the trend it shows every sign of two or three years from now

we're gonna have a real problem yeah especially the thing you mentioned on the lock scale you go from like one in a hundred times it gets arrived to one and ten to exactly so you know I've seen many of these like grocks in my life right I was there when I watched when GPT-3 learned to

do arithmetic when GPT-2 learned to do regression a little bit above chance when you know when we got you know with clawed and we got better on like you know all these all these tests of helpful on as harmless I've seen a lot of grocks this is this is unfortunately not one that I'm excited about

but I believe it's happening so somebody it might say listen you were a co-author on this post that open I released it about GPT-2 where they said you know we're not gonna release the weights or the details here because we're worried that this model will be used for something you know bad and

looking back on it now it's laughable to think that GPT-2 could have done anything bad are we just like way too worried this is a concern that doesn't make sense for it is interesting it might be worth looking back at the actual text of that post so I don't remember it exactly but it should

it you know it's it's still up on the internet it says something like you know we're choosing not to release the weights because of concerns about misuse but it also said this is an experiment we're not sure if this is necessary or the right thing to do at this time but we'd like to

establish a norm of thinking carefully about these things you know you could think of it a little like the you know the the silamer conference in the in the 1970s right where it's like you know they were just figuring out recombinant DNA you know there it was not necessarily the case that

someone could do something really bad with recombinant DNA it's just the possibilities we're starting to become clear those words at least were the right attitude now I think there's a separate thing that like you know people don't just judge the post they judge the organization is this an

organization that you know is produces a lot of hype or that has credibility or something like that and so I think that had some effect on it I guess you could also ask like is it inevitable that people would just interpret it as like you know you can't get across any message more complicated

than this thing right here is dangerous so you can argue about those but I think the the basic thing that was in my head and the head the head of others who were who were involved in that and you know I think what what is what is evident in the post is like we actually don't know we have

pretty wide error bars on what's dangerous and what's not so we should you know like we we want to establish a norm of being careful I think by the way we have enormously more evidence we've seen enormously more of these rocks now and so we're well calibrated but there's still uncertainty right

in all these statements I said like in two or three years we might be there right there's a substantial risk of it and we don't want to take that risk but you know I wouldn't say it's it's 100 percent it could be 50 50 okay let's talk about cyber security which in addition to buy or risk is another

thing andthropic has been emphasizing how have you avoided the cloud micro architecture from leaking because as you know your competitors have been less successful at this kind of security can't comment on anyone else's security don't know what's going on in there a thing that we have

done is you know so so there are there are these these architectural innovations right that make training more efficient we call them compute multipliers because they're the equivalent of you know improving improving you know they're like having more compute our compute multipliers again I

don't want to say too much about it because it could allow an adversary to counteract our measures but we limit the number of people who are aware of of a giving compute multiplier to those who need to know about it and so there's there's a very small number of people who could leak all of

these secrets there's a larger number of people who could leak one of them but you know this is the standard compartmentalization strategy that's used in the intelligence community or you know resistant cells or whatever so you know we've over the last uh over the last few months we've

implemented these measures so you know I don't want to jinx anything by saying oh this could never happen to us but I think I think it would be harder for it to happen I don't want to go into any more detail and you know but by the way I encourage all the other companies to do this as well

it's as much as like competitors architectures leaking is is narrowly helpful to anthropic it's not good for anyone in the long run right um so security around this stuff is really important even with all the security you have could you with your current security prevent

a dedicated state level actor from getting the claw two weights it depends how dedicated is what is what I would say our our head of security who who was you know used to work on security for chrome which you know yeah very widely used in attacked application he likes to think about it in terms of

how much would it cost to attack and throp it successfully again I don't want to go into super detail of how much I think it will cost to attack and it's kind of inviting people but like one of our goals is that it costs more to attack and thropic than it costs to just train your own

model um which doesn't guarantee things because you know of course you need the talent as well so you might still but you know but but attacks have have have risks the diplomatic costs uh and you know and and and and they use up the very the very sparse resources that nation state actors might have

in order to to do to do the attacks um so we're not there yet by the way but I but I think I think we're to a very high standard compared to the size of company that we are like I think if you look at security for most 150 person companies like I think there's there's just no comparison um but

you know could we could we resist if if it was a state actors top priority to steal our model weights no they would they would succeed how long does that stay true because at some point the value keeps increasing and increasing and another part of this question is that what kind of a secret is

how to train cloud three or cloud two is it you know with nuclear weapons for example we have lots of spies you just take a blueprint across and that's the implosion device and that's what you need here is it just is it more tacit like the thing you're talking about biology you need to know how

these reagents work is it just like you got the blueprint you got the micro architecture and hyperparameter there are some things that are like you know a one-line equation and there are other things that are more complicated yeah um and I think compartmentalization is the the best way to do

it just limit the number of people who know about something if you're a thousand person company and everyone knows every secret like one I guarantee you have some you have a leaker and two I guarantee you have a spy like a literal spy okay let's talk about alignment and let's talk about mechanistic

interpretability which is the branch of which you you guys specialize in while you're answering this question you might want to explain what mechanistic interpretability is but just the broader question is mechanistically what is alignment is it that you're locking in the model into a benevolent

character are you disabling deceptive circuits and procedures like what concretely is happening yeah when you align a model I think as with most things you know when we actually train a model to be aligned we don't know what happens inside the model right there are different ways of training

to be aligned but I think we don't really know what happens I mean I think for some of the current methods I think all the current methods that involve some kind of fine-tuning of course have the property that the underlying knowledge and abilities that we might be worried about don't don't

disappear it's just you know the model is just taught not to output them I don't know if that's a fatal flaw or if you know or if that's just the way things have to be I don't know what's going on inside mechanistically and I think that's the whole point of mechanistic interpretability to really

understand what's going on inside the models at the level of individual circuits eventually when it's solved what does the solution look like where what is it the case where if you're clot for you do the mechanistic interpretability thing and you're like I'm satisfied it's aligned what is it

that you've seen yeah so I think I think we don't know that yet I think we don't know enough to to know that yet I mean I can I can give you a sketch for like what the process looks like as opposed to what the final result looks like so I think verifiability is a lot of the challenge

here right we have all these methods that report to align AI systems and do succeed at doing so for today's tasks but then the question is always if you had a more powerful model or if you had a model in a different situation would it would it would it be aligned and so I think this problem

would be much easier if you had an oracle that could just scan a model and say like okay I know this model is aligned I know what it'll do in every situation then the problem would be much easier and I think the closest thing we have to that is something like mechanistic interpretability it's

not anywhere near up to the task yet but I guess I would say I think of it as almost like an extended training set and an extended test set right everything we're doing all the alignment methods we're doing are the training set right you you know you can you can run tests in them but we'll really

work out a distribution we'll really work in another situation mechanistic interpretability is the only thing that even in principle and we're nowhere near there yet but even in principle is the thing where it's like it's more like an x-ray of the model than a modification of the model right it's

more like an assessment than an intervention and so somehow we need to get into a dynamic where we have an extended test set an extended training set which is all these alignment methods and an extended test set which is kind of like you x-ray the model and say like okay what works then what

didn't in a way that goes beyond just the empirical tests that you that you that you've run right um where you're saying what is the what is the model going to do in these situations what is it within its capabilities to do instead of what did it do phenomenologically and of course we have to be careful about that right one of the things I think is very important is we should never train for interpretability because I think that is that's taking away that advantage right you even have the

problem you know similar to like validation versus test set we're like if you look at the x-ray too many times you can interfere but I think that's a much weaker optim we should worry about that but that's a that's a much weaker process it's not automated optimization we should just make sure

as with validation and test sets that we don't look at the validation set too many times before running the test set but you know that's again that's that's more of a that's that manual pressure rather than automated pressure and so some solution where it's like we have some dynamic between the

training and test set where it's like we're we're trying things out and we we really figure out if they work via way of testing them that the model isn't optimizing to get some some more thogonal way like if if I if I think of and I think we're never going to have a guarantee but some process where we do those things together again not in a stupid way there's lots of stupid ways to do this where you fool yourself but like some way to put extended training for alignment ability with extended

testing for alignment ability together in a way that actually works. I still don't feel like I understand the intuition that what why you think this is likely to work or this is a promising to pursue and let me ask the question in a certain more specific way and excuse the torture of

analogy but listen if you're you're an economist and you want to understand the economy. Yeah so you send up a whole bunch of micro economists out there and one of them studies how the restaurant business works one of them studies how the tourism business works you know one of them studies how the baking works and at the end they all come together and you still don't know where there's

going to be a recession in five years or not. Why is this not like that where you have an understanding of we understand how induction heads work and a two-layer transformer we understand you know modular arithmetic how does this add up to does this model want to kill us like what does this model fundamentally

want. A few things on that I mean I think that's like the right set of questions to ask I think what we're hoping for in the end is not that we'll understand every detail but again I would give like the x-ray or the MRI analogy that like we can be in a position where we can look at the broad

features of the model and say like is this a model whose internal state and plans are very different from what it externally represents itself to do right is this a model where we're uncomfortable that you know far too much of its computational power is you know is devoted to doing what look like

fairly destructive and manipulative things. Again we don't know for sure whether that's possible but I think some at least positive signs that it might be possible again the model is not intentionally hiding from you right it might turn out that the training process hides it from you

and I you know I can think of cases where the model is really super intelligent it like thinks in a way so that it like affects its own cognition I suspect we should think about that we should consider everything I suspect that it may roughly work to think of the model as you know if it's

trained in in in the normal way just out you know at the at the just getting to just above human level it it may be a reasonably should check it may be a reasonable assumption that the internal structure of the model is not intentionally optimizing against us and I give an analogy like to human

so it's actually possible to you know to look at an MRI of someone and predict above random chance whether they're a psychopath there was actually a story a few years back about a neuroscientist who was studying this and they looked at his own scan and discovered that he was a psychopath

and then everyone everyone in his life was like no no that's just obvious like you're you're a complete asshole like you must be a psychopath and he was totally unaware of this the basic idea that you know that that there can be these macro features that like like psychopath is probably

good analogy for it right they're like you know this is what we would be afraid of model that's kind of like charming on the surface very goal oriented and you know very dark on the inside you know and and you know on the surface their behavior might look like the behavior of someone else

with their goals are very different the questions everybody might have is listen you know you mentioned earlier the importance of being empirical yeah and in this case you're trying to estimate you know you know listen are these activations sus yeah but is this something we can be a for to be

empirical about in on you know or do we need like a very good first-rinsed political reason to think no it's not just that these MRIs of the model correlate with you know being bad we need just like some just like deep root math proof that this is a line so it depends what you mean by

empirical I mean a better term would be phenomenological like I don't think we should be purely phenomenological in like you know here are some brain scans of like really dangerous models and here are some brain scans I think the whole idea of mechanistic interpretability is to look at

the underlying principles and circuits but I guess the way I think about it is like on one hand I've actually always been a fan of studying these circuits at the lowest level of detail that we possibly can and the reason for that is kind of that's how you build up knowledge even if you're

ultimately aiming for there's two there's too many of these features it's too complicated the end of the day we're trying to build something broad and we're trying to build some broad understanding I think the way you build that up is by trying to make a lot of these very specific discoveries

like you have to you have to understand the building blocks and then you have to figure out how to kind of use that to draw these broad conclusions even if you're not going to figure out everything you know I think you should probably talk to Chris Ola who would have much more detail right this

is my kind of high level thinking on it like Chris Ola controls the interpretability agenda like you know he's he's the one who decides what to what to do on interpretability this is my high level thinking about it which is not going to be as good as his there's a bookcase on Anthropic rely

on the fact that mechanistic interpretability is helpful for capabilities I I don't think so at all now I do think I think in principle it's possible that mechanistic interpretability could be helpful with capabilities we might for various reasons not choose to talk about it if that were the case

that you know that that wasn't something that I thought of thought of or that any of us thought of at the time of Anthropics founding I mean we we thought of ourselves as like you know we're people who are like good at scaling models and good at doing safety on top of those models and like you

know we think that we have a very high talent density of folks who are good at that and you know my view has always been talent density beats talent mass and so you know that's that's more that's more of our bookcase talent density beats talent mass I don't think it it depends on some particular

thing like others are starting to do mechanistic interpretability now and I'm very glad that they are you know that was there's a part of our a part of our a part of our theory of change is paradoxically to make other organizations more like us talent density I'm sure is important but

one another thing Anthropic is emphasized is that you need to have frontier models in order to do safety research and of course like actually be a company as well the current frontier models something somebody might guess like GPT-4 or a plot to like hundred million dollars something

like that that general order of magnitude in very broad terms is not wrong but you know we're two to three years or now the kinds of things you're talking about we're talking more and we're orders of magnitude to keep up with that and to if it's the case that safety requires to be

on the frontier I mean what is the case in which Anthropic is like competing with these Leviathan's to stay on that same scale I mean I think it's I think it's a very it's a situation with a lot of trade-offs right I think it's I think it's not easy I guess to go back maybe I'll just like answer

the questions one by one right so like to go back to like you know why why is safety so tied to scale right some people don't think it is but like if I if I just look at like you know where where have been where have been the areas that you know you know I don't know like safety methods have

like been put into practice or like worked for something for anything even if we don't think they'll they'll work in general you know I go back to thinking of all the ideas you know something like you know debate and amplification right you know back in 2018 when we wrote papers about those

at an open AI it was like well human feedback isn't isn't quite in work but you know debate and amplification will take us beyond that but then if you if you actually look at and we've you know done attempts to do debates we're really limited by the by the quality of the model where it's like

you know for two models to have a debate that is coherent enough that a human can judge it so that the training process can actually work you need models that are at or maybe even beyond on some topics the current frontier now you can come up with the method you can come up with the idea

without being on the frontier but you know for me that's a very small fraction of what needs to be done right it's very easy to come up with these methods it's very easy to come up with like oh the problem is X maybe a solution is Y but you know I really want to know you know whether things

work in practice even for the systems we have today and I want to know what kinds of things go wrong with them I I just feel like you discover 10 new ideas and 10 new ways of things are going to go wrong by trying these in practice and that that empirical learning I think it's it's not

just not as widely understood as it should be kind of every you know I would say the same thing about methods like constitutional AI and some people say oh it doesn't matter like we know this method doesn't work it won't work for you know pure alignment I neither agree nor disagree with

that I think that's just kind of overconfident the way we discover new things and understand the structure of what's going to work and what's what's not is by playing around with things not that we should just kind of blindly say oh this worked here and so so work there but you you really you really

start to understand the patterns like with like with the scaling laws even mechanistic interpretability which might be the one area I see where a lot of progress has been made without the frontier models we're you know we're seeing and you know the work that say open AI put out a couple a couple

months ago that you know using very powerful models to help you auto interpret the weak models again that's not everything you can do an interpretability but you know that's that's a big component of it and we you know we found it useful too and so you see this this this phenomenon over and

over again where it's like you know the the scaling and the safety are these two snakes that are like coiled with each other always even more than you think right I you know with interpretability like I think three years ago I didn't think that this would be as true of interpretability but somehow

it manages to be true why because intelligence is useful useful for a number of tasks one of the tasks it's useful for is like figuring out how to judge and evaluate other intelligence and maybe someday even even for you know doing the alignment research itself given how that's true what what is

that imply for anthropic when in two to three years these Leviathan's are doing like 10 billion dollar training runs choice one is if it if we can't or if it costs too much to stay on the frontier then you know then then we shouldn't then we shouldn't do it and you know we won't work

with the most advanced models we'll see what we can get with you know models that are not quite as advanced I think you can get some value there like non-zero value but I'm I'm kind of skeptical that the value is all that high or the learning can be fast enough to really to really be in favor of

the task the second option is you just you just find a way you just uh you know you just accept the trade-offs and I think the trade-offs are more positive than they appear because of phenomenon that I've called race to the top um I could go into that later but I'll just let me put that aside

for now uh and then I think the third phenomenon is you know as things get as things get to that scale I think this may coincide with you know starting to get into some non-trivial probability of very serious danger again I think it's going to come first from misuse the kind of bio stuff that

I talked about but I don't think we have the level of autonomy yet to worry about some of the you know alignment stuff happening in like two years but it might not be very far behind that at all you know that that may that may lead to unilateral or multilateral or government

enforced which we support decisions uh not to scale as fast as we could um that may end up being the right thing to do so I you know actually that's kind of like I kind of hope things go in that in that direction um and then we don't have this hard trade-off between we're not in the frontier

and we can't quite do the research as well as as well as we want or influence other orgs as well as we want um or versus we're kind of on the frontier and like have to accept the trade-offs which are which are net positive but like have a lot in both in both directions okay on the on the

misuse versus misalignment those are both problems as you mentioned but in the long scheme of things what what is what are you more concerned about like 30 years down the line which do you think will be considered bigger problem i think it's much less than 30 years um but i'm worried about both

i don't know if you have if you if you have a model that could in theory you know like take over the world on its own um if you were able to control that model then you know it follows pretty simply that you know if a model was following the wishes of some small subset of people and not

others then those people could use it to take over the world on on their on their behalf the very premise of misalignment means that we should be worried about misuse as well with similar levels of consequences but but some the some people who might be more doomeried than you would say misuse

is you're already working towards the optimistic scenario there because you've at least figured out how to align the model with the bad guys now you just need to make sure that it's a line with the good guys instead why do you think that you could get to the point where it's aligned with the back

you know you haven't already saw that i guess if you had the view that like alignment is completely unsolvable then uh you know then you'd be like well i don't you know we're dead anyway so i don't want to worry about misuse that's not my position at all but but also like you should think in terms

of like what's a plan that would actually succeed that would make things good any plan that actually succeeds regardless of how hard misalignment is to solve any problem any plan that actually succeeds is going to need to solve misuse as well as misalignment it's getting to solve the fact that

like as the AI models get better you know faster and faster they're going to create a big problem around the balance of power between countries they're going to create a big problem around is it possible for a single individual to do something bad that's it's hard for everyone else to stop

any actual solution that needs to at least do a good future needs to solve those problems as well if your perspective is we're screwed because we can't solve the first problem so don't worry about problems two and three like that that's not really a statement you shouldn't worry about

problems two and three right like they're in our path what no matter what yeah in this scenario we succeed we have to solve all so yeah we might as well operate we should be planning for success not for failure if misuse doesn't happen and the right people have the superhuman

models what does that look like like who are the right people who who is actually controlling the model from five years from now yeah i mean my my view is that these things are powerful enough that i think you know it's it's going to involve you know substantial role or at least involvement of

you know some kind of government or assembly of government bodies again like you know they're they're kind of very naive versions of this like you know i don't think we should just you know i don't know like can't hand the model over to the u.n. or whoever happens to be in office at a

given time like i could see that go poorly but there it's it's too powerful there needs to be some kind of legitimate process for managing this technology which you know includes the role of the people building it includes the role of like democratically elected authorities includes the

role of you know all the all the individuals who will be affected by it so that they're at the end of the day there needs to be some politically legitimate process but what does that look like if it's not the case that you just hand it to whoever the president is at the time yeah is what is the

body look like way i mean is this something here these are things it's really hard to know ahead of time like i think you know people love to kind of propose these broad plans and say like oh this is the way we should do it this is the way we should do it i think the honest fact is that we're

figuring this out as we go along and that you know and anyone who says you know this is this is the body that we you know we should create this kind of body modeled after this thing like i think i think we should try things and experiment with them with less powerful versions of the technology

we need to figure this out in time but but also it's not that really the kind of thing you can know in advance the long-term benefit trust that you have how did how would that interface with this body is that the body itself if not is it like was it just for the kind yeah i want to explain

what it is for the audience but i don't know i think that the long-term benefit trust is like a much a much narrower thing like this is something that like makes decisions for and thropic so this is basically a body is described in a recent box article will be saying more about it in you know later

later this year but it's basically a body that over time gains the ability to appoint the majority of the board seats of entropic and this is so you know it's a mixture of experts in i'd say like AI alignment national security and philanthropy in general but if control is handed to them of

entropic that doesn't imply the control of if anthropic has a g i the control of a g i itself is handed to them that doesn't imply that and thropic or any other entity should be the entity that like makes decisions about a g i on behalf of humanity i would think of those as different i mean

there's lots of maybe you know like if entropic does play a broad role then you'd want to like widen that body to be you know like a whole bunch of different people from around the world or maybe you can screw this as very narrow and then you know there's some like broad committee somewhere

that like manages all the ag is of all the companies on behalf on behalf of anyone i i don't know like i i think my view is you shouldn't be sort of overly constructive and utopian like we're dealing with a new problem here we need to we need to start thinking now about you know what are the

what are the governmental bodies and structures that could that could deal with it okay so let's forget about governance let's just talk about what this going well looks like obviously there's a things we can all agree on you know cure all the diseases you know solve all the problem everything

things all humans would say i'm down for that yeah but now it's 2030 you've solved all the real problems that everybody can agree on what what happens next what what are you doing with a superhuman god i think i actually want to like i don't know like disagree with the framing or something like this

i actually get nervous when someone says like what are you going to do with the superhuman AI like we've learned a lot of things over the last 150 years about like markets and democracy and each person can kind of define for themselves like what what the best way for them to have the human

experiences and that you know societies work out norms and what they value in this just in this very like complex and decentralized way now again if you have these safety problems that can be a reason why you know and especially from the government there needs to be maybe until we've solved

these problems a certain amount of like centralized control but but but as a matter of like we've solved all the problems now how do we make things good i think that that most most people most groups most ideologies that started with like let's sit down and think think think think think

over what the definition of the good life is like i think i think most of those have led to disaster but so this vision you have of a sort of tolerant liberal democracy market oriented system with AI agi like what is each person has their own agi like what does that what does that mean i don't

know i don't know what it looks like right like i guess what i'm saying is like we need to solve the kind of important safety problems and the important externalities and then and then subject to that you know which again you know those could be just narrowly about alignment there could be a bunch

of economic issues that are super complicated and that we can't solve you know subject to that like we should think about what's worked in the past and i think in general like you unitary visions for what it means to to live a good life have not worked out well at all

on on the opposite end of things going well or good actors having control of AI we might want to touch on china is like a potential actor in the space well so first of all i mean being at by doing like seeing progress in AI happening generally

why do you think the Chinese have underperformed you know by do how to scaling laws group many years back or as the premise wrong and i'm just not aware of the progress that's happening there well for the scaling laws group i mean that was an offshoot of the stuff we did with speech

um so uh you know there were still some people there but that was a mostly Americanized lab i mean i was there for a year that was you know my first foray in the deep glory it was led by andrewing i never went to china most you know there's like a us lab so i think that was somewhat

disconnected although it wasn't attempt by you know a chinese entity to kind of get it get into the game but i don't know i think since then you know i couldn't speculate but i think they've been maybe very commercially focused and not as focused on these fun kind of fundamental

research side of things around scaling laws now i do think because of all the you know excitement with the release of chat gpt in you know november or so um you know that's been a starting gun for them as well and they're trying very aggressively to catch up now um i think

where the us is quite substantially ahead but i think they're trying very hard to catch up now how do you think china thinks about agi are they thinking about safety and misuse or not i i don't really have a sense um you know one concern i would have or if people say things like well china

isn't going to develop an a i because you know they like stability or you know they're going to have all these restrictions to make sure things are in line with what the ccp wants you know that that might be true in the short term and for consumer products my my worry is that if the basic

incentives are about national security and power um that's going to become clear sooner or later um and and so you know they're they're i think they're going to if they see this as you know a source of national power they're going to at least try to do to do what's most effective and that you know

that could lead them in the direction of agi at what point it like is it possible for them they just get your blueprints or your code base or something that they can just spin up their own lab that is competitive at the frontier with the leading american companies well i don't know about

fast but i'm like i'm concerned about this um so this is one reason why we're focusing so hard on cyber security um you know we've worked with our cloud providers we really you know like you know we have this blog post out out about security where we said you know we have a two key

system for access to the model weights we have other measures that we put in place or thinking of putting in place that you know we haven't announced we don't want an adversary to know about them but we're happy to talk about them broadly all this stuff we're doing is is by the way not

sufficient yet for super determined states to state state level actor edel um i i think it it will defend against most attacks and against a a state level actor who's not you know who's what's determined uh but there's a lot more we need to do and some of it may require new research

on how to do security okay so let's talk about what it would take at that point uh you know we're at an anthropic offices and you know it's like god go to a security we have to get badges and everything to come in here but the eventual version of this building or bunker whatever

where the agi is built i mean what does that look like are we is it a building in the middle of San Francisco or is it you're out of the middle of Nevada or zona like what what is the point in which you are like lost island not mosing it at one point there was a running joke somewhere that you

know the way the way building agi would look like is you know there would be a data center next to a nuclear power plant next to a bunker yeah um and you know that we we all we all kind of live in the bunker and everything would be local so we wouldn't get on the internet um you know again if we

you know if we take seriously the rate at which the you know the rate at which all this is going to happen which i don't know i can't be sure of it but if we take that seriously then it you know it it does make me think that maybe not something quite as cartoonish as that but that something

like that might happen what is the time scale on which you think alignment is solvable if like these models are getting to human level or in some things in two to three years what is the point at which they're aligned i think this is a really difficult question because i actually think often

people are thinking about kind of alignment in the wrong way i i think there's a general feeling that it's like models are misaligned or like there's like an alignment problem to solve kind of like the reman hypothesis or something like some day will crack the reman hypothesis i don't quite think

it's like that not in a way that's i that's worse or better it might be just as bad or just as just as unpredictable when when i think of like you know why am i why am i scared um few things i think of one is look like i think the thing that's really hard to argue with is like there will be

powerful models they will be agentic we're getting towards them if such a model wanted to wreak havoc and destroy humanity or whatever i i think we have basically no ability to stop it like that's that's i think just just if that's not true at some point it'll continue to be true as we you know

it will reach the point where it's true as we scale the models um so that definitely seems the case and i think a second thing that seems the case is that we seem to be bad at controlling the models not in any particular way but just there are statistical systems and you can ask a million things

and they can say a million things and reply uh and you know you might not have thought of a million the one thing that does something crazy or when you train them you train them is very abstract way and you might not understand all the consequences of of what they do in response to that i mean i think the best example we've seen of that is like being and being insidient right where it's like i i i i don't know how they train that model i don't know what they did to make it do all this weird

stuff like you know threaten threaten people and you know have this kind of weird obsessive personality but but what it shows is that we can get something very different from and maybe opposite to what we

intended and so i actually think facts number one and fact number two are like enough to be really worried um like you don't need all this detailed stuff about you know converging instrumental goals or you know analogies to evolution like actually one and two for me are pretty motivated i'm like

okay this thing's going to be powerful it could destroy us and like all the ones who built so far like you know are at at pretty decent risk of doing some random shit we don't understand yeah if i agree with that and i'm like okay i'm concerned about this the research agenda you have

of a mechanism to interpability plus you know a constitution a i and there are other i-lach F stuff if you say that we're gonna get something with like biopins or something that could be dangerous in two to three years yes do these things culminate within two to three years of actually meaning

fully contributing to yes preventing yes so i think i think where i was going to go with this is like you know people talk about like doom by default or alignment by default like i think it might be kind of statistical like you know like you might get you know with the current models you might get

being or Sydney or you might get caught and it doesn't really matter because being or Sydney like if we take our current understanding and you know move that to to very powerful models you might just being this world where it's like okay you make something and depending on the details maybe it's

totally fine um you know not really alignment by default but but just kind of like it depends on a lot of the details and like if you if you're very careful about all those details and you know what you're doing you're getting it right but we have a high susceptibility to you mess something

up in a way that you didn't really understand was connected to actually instead of making all the humans happy it wants to you know turn them into pumpkins yeah i you know i just some weird shit right because the models are so powerful you know they're like these kind of giants that are you

know they're they're like you know they're standing in a landscape and if they start to move their arms around randomly they could just break everything um i guess i'm starting it with that with that kind of framing because it's not like i don't think we're aligned by default i don't think

we're doomed by default and have some problem we need to solve it it has some kind of different character now what i do think is that hopefully within a time scale of two to three years we get better at diagnosing when the models are good and when they're bad we get better at training

you know increasing our repertoire of methods to train the model that they're less likely to do bad things and more likely to do good things in a way that isn't just relevant to the current models but scales and we can help develop that with interpretability as the test set i don't think of it as

oh man we tried our lhf it didn't work we tried constitutional it didn't work like we tried this other thing it didn't work we tried mechanistic interpretability now we're gonna try mechanistic um i think this frame of like man we haven't cracked the problem yet we haven't

solved the reman hypothesis isn't quite right um i think of it more as already with today's systems we are not very good at controlling them and the consequences of that could be could be could be very bad we just need to get more ways of like increasing the likelihood that are that

you know that we can control our models and understand what's going on in them and like we have some of them so far they aren't that good yet um but you know i i i don't think of this as a binary of like works and not works we're gonna develop more and i do think that over over the next two to three years we're gonna start eating that probability mass of ways things can go wrong um you know it's kind of like in the core safety views paper there's a probability mass of how hard the problem is

i feel like that way of staying it isn't really even quite right right because i don't feel like it's the reman hypothesis to solve i you know i i just feel like you know it's it's almost like right now if i try and you know juggle five balls or something i can juggle three balls right i actually

can but but i can't juggle five balls at all right you have to practice a lot to do that if i were to do that i would mostly draw i would i would almost certainly drop them and then just just over time you just get better at the task of controlling the balls on that post in particular what

is your personal probability distribution over so for the audience the three possibilities are it is like trivial to align these models with rlhf plus plus two it is a difficult problem but one that a big company could solve to something that is like basically impossible for human civilization

currently to solve if i'm capturing those three what is your probability distribution over those three personally i mean i'm not super into like what's your probability distribution of x i think all of those have enough likelihood that you know they should be considered seriously i'm more

interested i'm question i'm much more interested in is what could we learn that shifts probability mass between them what is the answer to that i think that one of the things mechanistic interpret probability is going to do more than more than necessarily solve problems is it's going to tell us what's going on when we try to align models i i i i i i i i i i think it's basically going to teach us about this like one way i can imagine concluding that things are very difficult is if mechanistic

interpretability sort of shows us that i don't know problems tend to get moved around instead of being stamped out or that you get rid of one problem you create another one or it might inspire us or give us insight into why problems are kind of persistent or hard to eradicate or crop up like

for me to really believe some of these stories about like you know oh something will always you know there's always this convergent goal in this particular direction i think the abstract story is it's not uncompelling but i don't find it really compelling either nor do i find it necessary to

motivate all the safety work but like the kind of thing that would would really be like oh man we can't solve this is like we see it happening inside inside the x-ray because yeah because i i think right now there's just there's there's way there's way too many assumptions there's way too

much overconfidence about how all this is going to go um i have a substantial probability mass on this all goes wrong it's a complete disaster um but you know completely different way than anyone hadn't anticipated it would be beside the point to ask like how could it go different than anyone

anticipated so i on this in particular what information would be relevant how much would the difficulty of aligning cloud three and the next generation of models basically be like is that a big piece of information is that so so i think the people who are most worried are predicting that

all the subhuman like ai models are going to be alignable right they're they're going to see him align they're going to deceive us in some way i think it certainly gives us some information but i i am more interested in what mechanistic interpretability can tell us um because i again like

you see this x-ray it would be too strong to say it doesn't lie but at least in the current systems it doesn't feel like it's optimizing against us there exotic ways that it could you know i i don't think anything is a safe bet here but i think it's the closest we're going to get to something

it isn't actively optimizing against us let's talk about the specific methods other than mechanistic interpretability yes that you guys are researching when we talk about our lhf or you know constation AI whatever our lhf plus plus if you had to put it in terms of human psychology what is the change

that is happening are we creating new drives new goals new thoughts how is the models changing it terms of psychology when i think all those terms are kind of like inadequate for you know describing what's it's not clear how useful they are as abstractions for humans either i think we don't have

the language to describe what's going on and again i'd love to have the x-ray i'd love to look inside and say and and kind of actually know what we're talking about instead of you know what basically making up words which is what which is what i do what i'm what you're doing asking this question

where where you know we should we should just be honest we yeah we really have very little idea what we're what we're talking about so you know it would be great to say well what we actually mean by that is you know this circuit within here turns you know turns on and you know and you know after

we've trained the model then you know this circuit is no longer operative or weaker and that you know you would love to be able to say again we're it's going to take a lot of work to be able to do that model organisms which you hinted at before when you said we're doing these evaluations to see if

they're capable of you know doing dangerous things now and currently not how worried are you about a lab leak scenario where in fine tuning it or in trying to get these models to elicit dangerous behavior is you know make bio weapons or something you'd like leak somehow and actually makes the

bio weapon instead of telling you it can make the bio weapon with today's passive models i think it's not that much you know chatbots it's not so much of a concern right because it's like you know if you were to fine tune a model do that we do it privately and work with the experts and so you know

the the leak would be like you know suppose the model got open sourced or something and you know and then someone so i think for now it's mostly a security issue in terms of models truly being dangerous i mean you know i think i think we do have to worry that it's like you know if we make a

truly powerful model and we're trying to like see what makes it dangerous or safe then there could be more of a one-shot thing where it's like you know some risk that the model takes over i think the main way to control that is to make sure that the capabilities of the model that we test are not

such that they're capable of doing this at well point where the capabilities be so high where you're you say i don't even want to test this oh well there's different things i mean there's capability testing and you know but with that itself could lead to a leak if you're testing and replicate

that like what if it actually does sure but i i think i i mean i think what you want to do is you want to like extrapolate so we've talked with arc about this right you know you have like factors of two of computer something where you know you're like okay you know you know can the model do

something like you know open up an account on a ws and like make some money for itself like some of the things that are like obvious prerequisites to like complete survival in the wild and so just set those thresholds very well you know kind of very well below and then as you proceed upward from

there do kind of more and more rigorous tests and be more and more careful about about about what it is you're doing on a constitution a i and feel free to explain what this is for the audience but who decides what the constitution for the next generation of models or potentially superhuman

model is like how is that actually written i think initially you know to make the constitution we just took some stuff that was like broadly agreed on like the u.n. charter of you know you went declaration on human rights and um you know some of the stuff from apples terms of service

right stuff it's like you know consensus i'm like what's acceptable to say or like you know what what basic things are able to be included so one i think for future constitutions we're looking into like more participatory processes for making these um but i think beyond that i don't think

there should be like one constitution for like a model that everyone uses like probably models constitution should be very very simple right it should only have very basic facts that that everyone would agree on then there should be a lot of ways that you can customize including

impending you know constitutions and and you know i think beyond that we're developing new methods right this is you know i'm not imagining that this or this alone is the method that will use to train superhuman AI right many of the parts that capability training may be different and so you know it

could look very different and and again i'd go there like there levels above this like i'm pretty uncomfortable with like here's the AI's constitution it's gonna run the world like that you know again like just normal lessons from like how societies work and how politics works like that

that just kind of yeah that that strikes me as fanciful like i you know i think i think we should try to hook these things into you know even even when they're very powerful again after we've mitigated the safety issues like any good future even even if it has all these security issues

that we need to solve it it somehow needs to end with with something that's that's that's more decentralized and and you know less like a godlike soup super and you know i just i just don't think that ends well uh what scientists from the Manhattan project do you respect the most in terms of

the act at most ethically under the constraints they were given well is everyone that comes to mind i don't know i mean i you know i think there's there's a lot of answers you could give i mean i'm definitely a fan of zillard for having kind of figured it out he was then you know against the

against the the actual dropping of the bomb i don't actually know the history well enough to have an opinion on whether you know demonstration of the bomb could have could have ended the war i mean that involves a bunch of facts about imperial japan that are you know that are that are complicated and that i that i'm not an expert on um but you know zillard seemed to you know he discovered this stuff early he kept it secret you know you know you know patten to some of it and put it in the hands

of the the british admiralty um so you know he seemed to display the right kind of awareness as well as as well as uh as well as discovering stuff i mean it was when i read that book that i kind of you know when i wrote this big blob of compute doc and many of you know i only showed it to a few

people and there were other docs that i showed almost no one uh so you know i yeah it was a bit a bit inspired by this again i mean i you know we can all get self aggrandizing here like we don't know how it's going to turn out or if it's actually going to be actually going to be something on

par with the menhattan project i mean you know this this could all be just silicon valley people building technology and you know just kind of like having delusions of grandeur so i don't know how it's going to turn out i mean if if if the scaling stuff is true then it's more bigger than that

hand probably yeah yeah certainly it certainly could be bigger i i just you know we should we should always kind of i don't know maintain this attitude that it's it's really easy to fool yourself if you were asked by the government if your phases is during world war two and you were asked by the

government to contribute non replaceable research to the manhand project well what do you think you would have said yeah i mean i think giving you're in a war with the Nazis um at least during the period when you thought that the Nazis were i don't yeah i don't really see much choice but uh

but to do it if it's possible you know you have to figure it's going to be done within within 10 years or so by someone regarding cybersecurity what should we make of the fact that there's a whole bunch of tech companies which have ordinary tech companies security policy that publicly

seeming facing it's not obvious that they've been hacked like coin base still has a spit coin you know google as far as i know my Gmail hasn't been leaked should we take from that that current status quo tech company security practices are good enough for agi or just simply that nobody has

tried hard enough it would be hard to for me to speak to you know current tech company practices and of course there may be many attacks that we don't know about where things are stolen and then silently used you know i mean i think an indication of it is when someone really cares basically cares

about attacking someone uh then often the attacks happen so um it you know recently we saw that some um fairly high officials of the u.s government had their email accounts hacked via via microsoft Microsoft was providing the email accounts um so you know presumably that that related to

information that was you know of great interest to you know to foreign adversaries um and so it it sounds it seems to me at least you know that the evidence is more consistent with you know when something is really high enough value than uh you know then then you know someone acts and it's stolen

and my worry is that of course with with agi we'll get to a world where you know the value is seen as incredibly high right that you know it'll be like stealing nuclear missiles or something you can't be too careful on this stuff um and you know at every place that i've worked i pushed for

the cyber security to be better one of my concerns about cyber security is you know it's not it's not kind of something you can trump it i think a good dynamic with safety research is like come you know you can get companies into a dynamic and i think we have where you know you can get them to

compete to do the best safety research and you know kind of use it as a i don't know like a like a recruiting point of competition or something we used to do this all the time with interpretability you know and and then sooner or later other other orgs started recognizing the the defect and

started working on interpretability whether or not you know that you know like whether or not that was a priority to them before but i think it's hard to do that with cyber security because a bunch of this stuff you have to do in quiet and so you know we did try to put out one post about it but i

think you know mostly you just you just see the results um you know i think people should you know a good norm would be you know people see the cyber security leaks from companies or you know leaks the model parameters or something and say you know that they they screwed up that's that's that's

bad if i'm a safety person i might not want to work there um of course as soon as i as soon as i say that we'll probably have a security breach tomorrow but uh um you know but but that's that's part of the game here right that's i think that's part of um you know try and try to make things safe

i want to go back to the thing we're talking about earlier where the ultimate level of cyber security required for two to three years from now and whether requires a bunk like are you actually expecting to be in a physical bunker in two to three years or is that just a metaphor yeah i mean i think i

think that's a metaphor um you know we're still figuring you know like something i would think about is like i think security of the data center which may not be in the same physical location as us but you know we worked very hard to make sure it's in the united states but securing the

physical data centers and the GPUs i think some of the really expensive attacks if someone was really determined just involved going into the data center and just you know trying to steal the data directly or as it's flowing from a data center to you know to to us i think these data centers

are gonna have to be built in a very special way i mean given the way things are scaling up you know probably anyway heading to a world where you know the you know networks of data centers you know cost as much as aircraft carriers or something um and and and so you know they're they're already

going to be pretty unusual objects but i think addition to being unusual in terms of their ability you know to to link together and train gigantic gigantic models they're also going to have to be very secure speaking of which how you know there's been sorts of rumors on the difficulty of

procuring the power and the GPUs for the next generation of models what has the process been like to secure the necessary components to do the next generation that's something i can't go into great detail about uh you know i i will say look like you know people think of even industrial

scale data centers right people are not thinking at the scale that i think these models are going to go to very soon and so whenever you do something in a scale where it's never been done before you know every every single component every single thing has to be done in a new way than it was

before and so you know you may you may you may run into problems with you know surprisingly simple components power is one that you mentioned and is this something that enthropic has handled or can you just outsource it you know i mean for data centers we work with cloud providers for instance

what should we make about the fact that these models require so much training and the entire corpus of internet data in order to be subhuman whereas you know if gpd4 there's been estimates that you know it was like a 10 to the 25 flops or something where you know whereas you i mean

you can take these numbers to the grain of salt but there's reports that you know human brain from the time is born to the time a human being is 20 years old that's like on the order of 10 to the 20 flops to simulate all those interactions you don't have to go to the particulars on

those numbers but should we be worried about how sample inefficient these models seem to be yes so i think that's one of the remaining mysteries one way you could phrase it is that the models are maybe two to three orders of magnitude smaller than the human brain if you compare to the number

of synapses while at the same time being trained on you know three to four more orders of magnitude data if you compare to you know number of words human human sees as they're developing to age 18 it's i don't remember exactly but i think it's in the hundreds of millions whereas for the models

we're talking about the hundreds of billions the trillions so what what explains this they're these offsetting things where the models are smaller they need a lot more data and they're still below human level but so you know that there's some way in which you know the analogy to the brain is not quite right or is breaking down or there's some there's some missing factor you know this is just kind of like in physics where it's like you know we can't explain the michaelson morally experiment

or like i'm forgetting one of the other 19th century physics paradoxes but like i think it's one thing we don't quite understand right human sees so little data and they still do fine one theory on it it could be that it you know it's it's like our other modalities um you know how do we get

you know 10 to the 14th bits into the human brain well well most of it is kind of these images and maybe a lot of what's going on inside the human brain is like you know our mental workspace involves all these these you know these these simulated images or something like that but honestly i think intellectually we have to admit that that's a weird thing that doesn't match up and you know it's one reason i'm a bit you know skeptical a kind of biological analogies i thought in terms of them like

five or six years ago but now that we actually have these models in front of us as artifacts it feels like almost all the evidence from that has been screened off by what we've seen and what we've seen are models that are much smaller than the human brain and yet yet can do a lot of the things

that humans can do and yet paradoxically require a lot more data um so maybe we'll discover something that makes it all efficient or maybe we'll understand why the discrepancy is present but at the end of the day i don't think it matters right if we keep scaling the way we are i think what's more

relevant at this point is just measuring the abilities of the model and seeing how far they are from humans and they don't seem terribly far to me does this scaling picture and the big blob of compute more generally does that under emphasize the role that algorithmic progress is played when you

composed the the big blob of compute so you know you're talking about LSTM's presumably at that point presumably the scaling on that would not have you at cloud to at this point so are you under emphasizing the role that uh an improvement of the scale of transformer could be having here

when you put it up behind a label of scaling this big blob of compute document which i still have not made public i probably should for like historical reasons i don't think it would tell anyone anything they don't know now but uh when when i wrote it i actually said look there are seven

factors that and you know i wasn't i wasn't like these are the factors but i was just like let me give some sense of the kinds of things that matter and what don't and so i wasn't thinking like these are the same you know there could be nine there could be five but like the things i said were i said

number of parameters scale of the model like you know the compute and compute matters quantity of data matters quality of data matters loss function matters so like you know are you doing rl or doing next word prediction if your loss function isn't rich or doesn't incentivize the right

thing you won't you won't get anything um so those were the key four ones uh which i think are the corridor hypothesis but then i said three more things one was symmetries which is basically like if your architecture doesn't take into account the right kinds of symmetries it doesn't work um

or it's it's very inefficient so for example convolutional neural networks taking to account translational symmetry ls tm's taking to account time symmetry and but a weakness of ls tm's is that they can't attend over the whole context so there's kind of this structural weakness like if a

model isn't structurally capable of like absorbing and managing things that happened in a far enough distant past and it's just like it's kind of like you know like the compute doesn't flow like the spice doesn't flow or it's like you can't like like the the blob has to be unencumbered right it

kind of it's not it's not going to work if if you artificially close things off and i think rn ns n ls tm's artificially close things off because they they close you off to the distant past and so again things need to flow freely if they don't it doesn't work and then you know

i had a couple things one of them was like conditioning which is like you know if you're if the thing you're optimizing with is just really numerically bad like you're going to have trouble and so this is why like atom works better than you know than normal sdd and i think i'm forgetting what

the seventh condition was but it was it was similar to things like this where it's like you know if you if you if you set things up in kind of a way that's that's set up to fail or that doesn't allow the compute to work in an unhibited way then it then won't work and so transformers were kind of

within that even though i can't remember if the transformer paper had been published it was around the same time as i wrote that document it might have been just before it might have been just after hmm it sounds like from that view that the the the way to think about the these algorithmic

progresses is not as increasing the power of the blob of compute but simply getting rid of the artificial hindrances that older architectures have is that is that a fair idea that's that's a little that yeah that's that's a little how i think about it you know again if you go back to like it was like the models want to learn yeah yeah like like the compute wants to be free yeah yeah and like you know it's being blocked in various ways where you like don't understand that it's being blocked

and so you need to like free it up right right and i love the gradients change that this fight okay on that point though so do you think that another thing on the scale of a transformer is coming down the pike to enable the next the next graded errations i think it's possible

i mean people have worked on things like you know trying to model very long time dependencies or you know try you know there's various different ideas where i could see that we're kind of missing an efficient way of representing or dealing with something so i think those inventions are possible

i guess my perspective would be even if they don't happen we're all we're already on this very very steep trajectory and so i'm less i mean we're constantly trying to discover them as are as or others but things are already on such a fast trajectory all that would do is speed up the

trajectory even more and probably probably not by that much because it's already going so fast is something embodied or having an embodied version of a model is that at all important in terms of getting either data or progress i'd think of that less in terms of the you know like a new

architecture and more in terms of like a lost function like the the data the environments you're exposing yourself to end up being very different and and so i think that could be important for learning some skills although data acquisition is hard and so things have gone through the language route

and i would guess we'll continue to go through the language route even as you know even as as more as possible in terms of embodiment and then the other possibilities you mentioned rl you can see it as yeah i mean we we kind of already do rl with rl hf right people are like is this

in alignment is capabilities i always think in terms of the two snakes right they're they're kind of often hard to distinguish so we we already kind of use rl in these language models but i think we've used rl less in terms of getting them to take actions and you know do things in the world but you

know when you take actions over a long period of time and understand the consequences of those actions only later than you know rl is a typical tool we have for that so i would guess that in terms of models taking action in the world that rl will you know will become a thing with all

the power and all the safety issues that come with it when you project out in the future do you see the way in which these things will be integrated into productive supply chains do you see them talking with each other and criticizing each other and contributing to each other's output

or is it just the model one shots the one model one shots the the answer or the work models will undertake extended tasks that will have to be the case i mean we may want to limit that to some extent because it may make some of the safety problems easier um but you know some of that i think

will be required in terms of our models talking to models or are they talking to humans again this goes kind of out of the technical realm and into the like socio cultural economic realm where my heuristic is always that it's very very difficult to predict things um and so i i feel like these

scaling laws have been very predictable but then when you say like well you know when when is there going to be a commercial explosion in these models or what's the form it's going to be or are the models going to do things instead of humans or pairing with humans i feel like certainly my track record on predicting these things is is terrible uh but i also looking around i don't really see anyone who's track record is great you mentioned how fast progress is happening but also the difficulties

of integrating within the existing economy into the way things work do you think there will be enough time to actually have large revenues from AI products before the next model is just so much better we're in like a different landscape entirely it depends what you mean by by large right you know

i think multiple companies are already in the you know 100 million to billion per year range what will it get to the hundred billion or trillion range you know before i that stuff is just so hard to predict right it's and it's it's it's not even super well defined like you know i think

right now there are companies they're throwing a lot of money at at generative AI you know as as as customers but and and all you know i think i think that's the right thing for them to do and they'll you know they'll find uses for it but it doesn't mean there doesn't mean it's you know

they're finding uses are the best uses from day one so even money changing hands is not it's not quite the same thing as economic value being created but it surely you've thought about this from the perspective of an anthropic wherever these things are happening so fast then it should be an

insane valuation right even us who have you know not been super focused on commercialization and more on safety i mean you know the graph goes up and it goes up it goes up relatively quickly yeah um so you know i can i can only imagine what's happening that you know the the org's or you know

there this is this is this is their singular focus um so it's certainly happening fast but you know again it's it's like it's the exponential from the small base while the technology itself is moving fast so it's it's kind of a race between how fast the technology is getting better and

how fast it's integrated into the economy and that i think that's just a very unstable and turbulent process both things are going to happen fast but if you ask me exactly how it's going to play out exactly what other things are going to happen i i i i i don't know and i'm good i had a skeptical

of the ability to predict i'm a kind of curious with regards to an anthropic specifically yes your public benefit corporation yes and rightfully so you want to make sure that this is an important technology the obviously the only thing you want to care about is not sure of their value

but how do you talk to investors who are putting in like hundreds of millions billions of dollars of money like how do you talk to them about the fact that how do you get them to put in this amount of money without yes the shareholder value being the main concern so so i think the LTBT is is

you know the right thing on this right you know i mean we're going to talk more about the LTBT but like some version of that has been in development since the beginning of of an anthropic even even formally right and so you know from from the beginning you know even as the body has

has changed in some ways it's like from the beginning it was like this body is going to exist and it's you know it's unusual like every traditional investor who invests in an anthropic you know has to you know look looks at this some of them are just like whatever you run your company

how you want some of them are like you know oh my god like this this you know this body of random people or to them random people could like you know could could move entropic in a direction that's you know that's totally contrary to our in now there are there are legal limits on that of course

but you know we have to have this conversation with every investor and then it gets into a conversation of well what are the kinds of things that you know that we would we we might do that would be contrary to the to the you know to the interests of traditional investors and just having

those conversations has helped get everyone on the same page I want to talk about the physics and the fact that so many of the founders and the employees that anthropic are physicists what is the I mean we're talking at the beginning about the scaling laws and how the power loss of physics are

something you see here but you know what are the actual like approaches and ways of thinking from physics that seem to have carried over so well is that notion of effective theory super useful what what is going on here I mean I think part of it is just physicists learn things really fast we have

generally found that you know if we hire you know someone who is a you know physics PhD or something that they can they can learn ML and contribute just very very quickly in most cases and you know because several founders myself chair Kaplan Sam Sam McCann wish we're physicists we knew a lot

of other physicists and so we were able to hire them and now there's I don't know many is exactly you know might might be 30 or 40 of them here ML is not still not yet a field that has an enormous amount of depth and so they've been able to get up to speed very quickly are you concerned that

there's like a lot of people who would have been doing physics or something whatever they go into finance instead and since anthropic exist they have now been recruited to go into AI and you know they're you obviously care about AI safety but you know maybe in the future they leave and they

get funded to do their own thing is that a concern that you're bringing more people into the ecosystem here yeah I mean you know I think there's there's like a broad set of action you know like we're causing GPUs to exist we're you know there's there's a lot of kind of side effects that you can't

that that you can't currently control or that you just incur if you buy into the idea that you need to build frontier models and that's one of I mean a lot of them would have happened anyway I mean I mean finance was a hot thing 20 years ago so physicists were doing it now ml is the hot

thing and you know it's not like we caused them to do it when they had no interest previously but you know again you know at at the margin you're kind of you're kind of bidding things up and you know a lot of that would have happened anyway some of it some of it wouldn't but it's all part of

the calculus do you think that cloud has conscious experience how likely to do that is this is another of these questions that just seems very unsettled and uncertain one thing I'll tell you is I used to think that we didn't have to worry about this at all until models were kind of like operating in

rich environments like not necessarily embodied but like that you know they you know they needed like have a reward function and like have kind of long lived experience so I still think that might be the case but the more we've looked at kind of these language models and particularly looked

inside them to see things like induction heads a lot of the cognitive machinery that you would need for active agents seems kind of already present in the base language models so I'm not quite as sure as I was before that we're missing the things that you know that we're missing enough of the

things that you would need I think today's models just probably aren't smart enough that we should worry about this too much but I'm not 100% sure about this and I do think the models will get in a year or two like this might be a very real concern well we'll we'll change if you found out that

they are conscious are you worried that you're like pushing the negative gradient to suffering like what is conscious is again one of these words that I suspect it will like not end up having a well-defined meeting but yeah but yeah well I suspect that's that's that's a that's a spectrum

right so I don't know if we if we if we discover like that you know that I should care about caught let's say we discover that I should care about college experience as much as I should care about like a dog or a monkey or something yeah I would be I would be kind of kind of worried I

don't know if their experience is positive or negative unsettlingly I also don't know like I wouldn't know if any intervention that we made was more likely to make clawed you know have a positive versus negative experience versus not having one if there's an area that is helpful with

this it's maybe mechanistic interpretability because I think if it is neuroscience for models and so it's possible that we could we could shed some shed some light on this although you know it's not it's not a straightforward factual question right it kind of depends what we mean and what we value

we talked about this initially but I I want to get more specific we talked initially about you know now that you're seeing these capabilities ramp up within the human spectrum you think that the human spectrum is wider than we thought but yeah more specifically what have you how is the way

you think about human intelligence different now that the way you're seeing these these marginally useful abilities emerge how does that change your picture of what intelligence is I think for me the big realization on what intelligence is came with the like blob of compute thing right like it's not

you know there might be all these separate modules there might be all this complexity you know it's you know rich sudden called it the bitter lesson right it's so it's called has many names been called the scaling hypothesis like the first few people who figured it out was around 2017 I mean

you could go further back to I think I think Shane Lake was maybe the first person who really knew it maybe Ray Kurzweil although in a very vague way but you know I think the number the number of people who understood it went up a lot around 2014 to 2020 2017 but I think I think that was the big

the big realization it's like you know well how did intelligence evolve well if you don't need very specific conditions to create it if you can create it just from like the right kind of the right kind of gradient loss signal then of course it's not so mysterious how it all happened in

terms you know it had this click of scientific understanding in terms of like watching what the models can do how has it changed my view of human intelligence I wish I had something more intelligent to say on that I feel like I don't know one thing that's been surprising is like I thought

things might click into place a little more than they do like you know I thought like different cognitive abilities might all be connected and there was more of one secret behind them but it's it's like the model just learns various things at different times you know and it can be like very

good at coding but like you know it can't it can't quite you know prove the prime number theorem yet and I don't I mean I guess it's a little bit the same for humans although it's it's weird the juxtaposition of things it can do and not I guess the main lesson is like having theories of

intelligence or how intelligence works like again a lot of these words just just kind of like dissolve into a continuum right they just kind of like dematerialize I think less in terms of intelligence and more in terms of what we see in front of us yeah no it's really surprising to

me two things one is how discrete these like different paths of intelligent things that contributed to lost eye rather than just being like one reasoning circuit or one general intelligence and the other thing talking with you that is surprising or interesting is many years from now it'll be one

of those things that looking back it'll be why it why weren't why wasn't this obvious to you if you're seeing these smooth scaling curves why the time where you're not completely convinced so you've been less public than the CEOs of other AI companies you know you're not posting on

Twitter you're not doing a lot of podcasts except for this one what what what what gives like why are you why are you off the radar yeah I aspire to this and I'm proud of this if people think of me as kind of like boring and low profile like this is actually kind of what I want so I don't know I

I've just seen a number of cases a number of people I've worked with that I think you could say Twitter although I think I mean a broader thing like just kind of like attaching your incentives very strongly to like the approval or cheering of a crowd I think that can destroy your mind and

in some cases it can destroy your soul and so I think I kind of deliberately tried to be a little bit low profile because I want to I don't know kind of like defend my ability to think about things intellectually in in a way that's different from other people and isn't isn't kind of

tinged by the approval of other people so you know I've seen cases of folks who are deep learning skeptics and they become known as deep learning skeptics on Twitter and then even as it starts to become clear to me they've kind of sort of changed their mind they like this is their thing on

Twitter and they can't change their Twitter persona and so forth and so on I don't really like the trend of kind of like personalizing companies like the whole you know like cage match between CEO's approach like I think it it distracts people from the actual merits and concerns of like the

the you know the the company in question like I kind of want people to like judge the like nameless bureaucratic institution rat rat you know I want people to think in terms of the nameless bureaucratic institution and its incentives more than they think in terms of me

everyone wants a friendly face but but actually I think friendly faces can be misleading okay well in this case it will be a misleading interview because this has been a lot of fun if you like a glass of talk to you indeed yeah this isn't a blast I'm just a super glad you came on the

podcast and uh hope people enjoyed thanks thanks for having me hey everybody I hope you'll enjoy that episode as always the most helpful thing you can do is just share the podcast then it to people you think might enjoy it put it in Twitter your group chats etc just blitz the world appreciate your listening I'll see you next time cheers

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.