John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI | Dwarkesh Podcast

00:00

Today I have the pleasure to speak with John Schulman, who is one of the co-founders of OpenAI, and leads the post-training team here. He also led the creation of ChatGBT, and is the author of many of the most important and widely cited papers in AI and RL, including PPO and many others. So John, really excited to chat with you. Thanks for coming on the podcast. Thanks for having me on the podcast, and a big fan. Thank you for saying that.

00:25

So the first question I had is, we have these distinctions between pre-training and post-training. Beyond what is actually happening in terms of loss function and training regimes, I'm just curious, taking a step back conceptually, like what kind of thing is pre-training creating? What does post-training do on top of that? In pre-training, you're basically training to imitate all of the content on the internet or on the web, including websites and code and so forth.

00:54

So you get a model that can basically generate content that looks like random web pages from the internet. And the model is also trained to maximize likelihood where it has to put a probability on everything. So the objective is basically predicting the next token given the previous tokens. Tokens are like words or parts of words. And since the model has to put a probability on it, and we're training with to maximize log probability, it ends up being very calibrated.

01:27

So it can not only generate all of the content of the web, it can also assign probabilities to everything. So the base model can effectively take on all these different personas or generate all these different kinds of content. And then when we do post-training, we're usually targeting a narrower range of behavior where we basically want the model to behave like this kind of chat assistant.

01:55

And it's a more specific persona where it's trying to be helpful. It's not trying to imitate a person. It's answering your questions or doing your tasks. And we're optimizing on a different objective, which is more about producing outputs that humans will like and find useful as opposed to just trying to imitate this raw content from the web. Yeah. Okay. I think maybe I should take a step back and ask. Right now we have these models that are pretty good at acting as chatbots.

02:28

Just taking a step back from how these processes were currently. What were the models released by the end of kind of things the model's really sending to you? What do you see the progress looking like five, you know, carry this forward for the next five years? Oh, yeah, five years. Yeah, I think the models will get quite a bit better. But in what way? In the first five years.

02:47

So, I mean, I think even in one or two years, we'll find that a lot of you can use them for a lot of more like involved tasks than they can do now. So you could. So for example, right now, like you could imagine having the models do carry out a whole coding project instead of maybe giving you one suggestion on how to write a function. So you could imagine the model, like you giving it sort of high level instructions on what to what to code up and it'll go in.

03:23

It'll go and write many files and test it. Look at the output iterate on that a bit. So just much more complex tasks. And fundamentally, the unlock is that it can act coherently for long enough to write multiple files of code or what what has changed between now and then. Yeah, I would say this will come from some combination of just training the models to do harder tasks like this.

03:49

So just like I'd say, right, the models aren't aren't particularly like most of the training data is more like doing single steps at a time. And I would expect us to do more for training the models to carry out these longer projects. So I'd say any any kind of training any like doing RL to learn how to do these tasks. However, you do it, whether it's whether you're supervising the final output or supervising it like each step.

04:24

I think any kind of training at carrying out these long projects is going to make them a lot better. And since the whole area is pretty new, I'd say there's just a lot of low hang fruit. Interesting. And doing this kind of training. So I'd say that's one thing. Also, I would expect that as the models get better, they're just better at recovering from errors or they have just. They're better at dealing with dealing with edge cases or when things go wrong, they know how to recover from it.

04:59

So the models will be more sample efficient so you don't have to collect a ton of data to teach them how to get back on track just a little bit of data or just their like generalization from from other abilities will allow them to get back on the track on track, whereas current models might just get stuck and get lost.

05:21

And I should actually how understand more specifically how the generalization helps you get back on track. Can you say more about that? I'm not sure got got by those two concepts are connected. Right. They're not directly connected. So I would say you usually have a little bit of data that does everything. So I mean, if you have, if you collected diverse data set, you're going to get a little bit of everything in it.

05:47

And if you have models that generalize really well, even if there's just a couple examples of getting back on track, I see, or even maybe in the pre training, there's examples of getting back on track, then like the model will be able to generalize from those other things it's seen to the current situation.

06:06

So I think like if you have models that are weaker, you might be able to get them to do almost anything with enough data, but you might have to put a lot of effort into a particular domain or skill, whereas for a stronger model, it might just do the right thing without any training data or any effort.

06:26

So I have some intuition about right now these models can maybe occur here in the for five minutes. We want them to be able to do tasks that for a human would take an hour than a week, then a month and so forth to get from each of these benchmarks is it going to be each one takes 10x more compute analogous to the current scaling loss for free training or is it going to be a much more

06:50

streamlined process because just getting to that point where you're already more sample efficient and then you can just you just go to the years of carrying out tasks or something. Yeah, I would say at a high level, I would agree that longer horizon tasks are going to require more model intelligence to do well and are going to be more expensive to train for.

07:14

I'm not sure I would expect there to be a really clean scaling law unless you set it up in a very careful way or design your design the experiment in a certain certain way because I would say there might end up being some phase transitions where once you get to a certain level, you can deal with, you can deal with much longer tasks. For example, people, I think when people do planning for at different time scales, I'm not sure they use completely different mechanisms.

07:57

I think we use the same mental machinery if we're thinking about one month from now, one year from now, or like 100 years from now. It's, so we're not actually doing some kind of reinforcement learning that where we need to worry about a discount factor that covers that time scale and so forth.

08:17

I think using language, you can describe all of these different time scales and then you can do things like plan. In the moment, you can try to make progress towards your goal, whether it's a month away or 10 years away. I might expect the same out of models where there are some kind of, I don't know if it's a phase transition, but there's some capabilities that work at multiple scales.

08:44

Well, okay, so to correct me, this was wrong, but it seems like that implies right now we have models that are on a per token basis, pretty smart. They might be as smart as humans on a per token basis, the smart as humans. And the thing that prevents them from being as useful as they could be is that five minutes from now, they're not going to be so writing your code in a way that's coherent and aligns with the broader goals you have if you were a project or something.

09:09

If it's the case that once you start this long horizon, our training regime, it immediately unlocks your ability to be coherent for longer periods of time should be predicting something that is human level as soon as that regime is unlocked or and if not, then what is remaining after you can plan for your and execute projects that take that long.

09:30

Yeah, it's not totally clear what we're going to see once we get into that regime and you have fast progress will be so that's that's still uncertain. I would say I would expect there to be I wouldn't expect everything to be immediately solved by doing any training like this.

09:50

I would think there'll be other like miscellaneous deficits that the models have that cause them to get stuck or not make progress or make worse decisions than humans. So I wouldn't say I expect that this one little thing will unlock every all capabilities, but I yet it's not clear. But it might like some improvement in the ability to do long horizon tasks might go quite far would you say it's plausible or as it seems quite likely that there will be other reasons why there might be bottlenecks.

10:22

I also kind of curious like what would be the nature of the bottlenecks so it has all these representations of retraining now we can do accurately for a long period of time because of long horizon RL what's remaining. Yeah, maybe there's some there's some other experience that human experts bring to different tasks like having some taste or dealing with ambiguity better. So I could imagine that if we want to do something like research like those those kind of considerations come into play.

10:58

Yeah, obviously there's there going to be just sort of mundane limitations around like affordances of the model like whether it can whether it can use you eyes and obviously the physical world or having access to things. There might be a lot of like mundane barriers that are probably not going to last that long, but would initially like slow down progress.

11:29

The websites that are designed for these AIs once they're much more multimodal or at least train on more multimodal data will be in any way different from the ones we have for humans like the UIs that will be needed. How compensating for their strengths and weaknesses, how would that look different from the current what you know UIs we have for humans.

11:48

Yeah, that's that's an interesting question. I mean, I would expect that models will be able to use websites that are designed for humans just by using vision like when the vision capabilities get a bit better. So there wouldn't be an immediate need to change them on the other hand, some websites that are going to benefit a lot from AIs being able to use them will probably want to design to be better UX is for AIs.

12:16

So I'm not sure exactly what that would mean, but probably like assuming that our models are still better in text mode than like reading text out of images, you'd probably want to have a good text based representation for the models.

12:34

And also just a good like indication of what are all the things that can be interacted with. But I guess I wouldn't expect the web to get like totally redesigned to have APIs everywhere because I I expect that we can get models to use the same kind of UIs that humans use. I mean, I guess that's been the big lesson of language models right that they can they can act in the similar affordances that humans have.

13:01

So the point to me to early are about this process could be more sample efficient because it could generalize from its experiences and free training of how to get unstuck in different scenarios. I'm curious what the strongest evidence of this kind of generalization and transfer you've seen is.

13:21

Yeah, like because the big question it seems about the future abilities models is like how how much generalization there is happening is there something that feels really compelling to you like you really learn something that you wouldn't expect it to learn from the generalization here.

13:36

There's definitely been some interesting instance of generalization and post training like one well known phenomenon is if you do all your fine tuning with English data you'll automatically you'll have the model also.

13:54

Behaving well in other languages so if you train the assistant on English data it'll also do something reasonable in Spanish say and sometimes you might get you might get the wrong behavior in terms of whether it replies in English or replies in Spanish but usually you get the you get the right behavior there as well like you get it to respond in Spanish to Spanish queries so that's one.

14:18

When kind of interesting instance of generalization that you just sort of latch on to the right helpful persona and then you automatically do the right thing in different languages we've seen some version of this with multi modal data where if you do text only fine tuning you also get reasonable behavior with images.

14:38

Early on in chat GBT we we were trying to fix some issues in terms of the model understanding its own limitations like like early versions of the model would think they could like send you an email or call you call a new bar or something like the model would try to play the assistant I would say oh yeah of course I sent that email and obviously it didn't so we we started

15:07

collecting some data to fix those problems and we found that a tiny amount of data did the trick even when you mix it together with everything else so I don't remember exactly how many examples but something like 30 30 example well we had us I don't know pretty small number examples showing this general behavior of like explaining that the model can't doesn't have this capability and that generalize pretty well to all sorts of capabilities we didn't train for.

15:35

I still want to go back to this because I'm not sure understood like if you have this model that is trained on to be coherent for longer periods of time does that imply that unless there are these other bottlenecks which they may or may not be by next year you could have models that are potentially like human level in terms of acting like you're interacting with this as a colleague and it's like almost it's like as good as interacting with the human colleague you can tell them to go do stuff and they go to the

16:05

way they're going to get it done. What seems wrong with that picture of this is the capabilities you think might be possible. Yeah it's hard to say exactly well will be the deficit I mean I would say that when you talk to the models today they have various weaknesses besides long term coherence in terms of also like really thinking hard about things or paying attention to what you asked them so I would say

16:34

I would say I wouldn't expect like just improving the coherence a little bit to like to be all it takes to get to a GI but I guess I wouldn't be able to articulate exactly what the main weaknesses that all stop them from like being a fully functional colleague it seems like you should be planning for the possibility what have a GI very soon.

17:00

Yeah I think it's I think that would be reasonable so what's the plan if like if there's no other bottlenecks next year or something you got a GI what's the plan well I would say that if a GI came way sooner than expected we would definitely want to we want to be careful about it and we would we might want to like.

17:20

Slow down a little bit on training and deployment until we're pretty sure we know we we can deal with it safely and we we have a pretty good hands along what it's going to do what it can do so I think yeah we would have to be we'd have to be very careful if it happened way sooner than expected because I think our understanding is rudimentary in a lot of ways still.

17:46

I would be careful mean like because I presumably already careful right you do these evaluations before your yeah I would say yeah just like. Maybe not not training the even smarter version of not like being really careful when you do train it that it's not. It's like properly sandbox and everything maybe not deploying it at scale or yeah being yeah being careful about what what scale you deploy it.

18:20

Yeah I guess I'm not okay so let's just play with the scenario like it happens next year and then you're you're not training a smarter system but and in your you're deploying somewhat in a measured way. Yeah I'm wondering. And so you wait to deploy a little bit now other companies have similar low capabilities what happens next so you've waited to deploy what are you waiting for what are you talking with these what what does every company doing in this scenario.

19:11

Yeah the game theory is a little tough to think through so yeah so first of all I don't think this can happen next year but it's still useful to have the conversation and maybe it's like two or three years and said but yeah I guess two or three years is still pretty.

19:18

Yeah so still pretty soon I do think you probably need some coordination like everyone needs to agree on some on some reasonable like limits to deployment or to further training for this to work otherwise otherwise you have the race dynamics where everyone's trying to everyone's trying to stay ahead and like everyone's like and that might require compromising on safety.

19:46

So I think you would probably need some coordination among the larger entities that are doing this kind of training and so you're coordinating to I guess pause deployment until until what exactly like until you figure out what's happening in the monitor like cause either further training pause deployment like avoid certain types of training that we think might be riskier so just like setting up some reasonable rules for.

20:15

Like what whatever one should do to yeah having everyone somewhat limit limit these things and but limit to what end because I guess at some point then you're going to like the potential energy that's within this intelligence will you know it'll be only show what what what what is a plan to like suppose in two years we get the AGI and now everybody's freaking out and so now the AIC companies have paused.

20:44

And now what or what what what would be the plan to wait till or yeah that's I don't have a good answer to that I mean I would say if we can if everyone is going to coordinate like that I think we'd be that would be an okay scenario that be a pretty good scenario because I do think like building these models is very capital intensive and there are a lot of complex pieces and

21:12

there are a lot of complex pieces so it's not like everyone's going to go and recreate the stuff at home. So I think it is possible to do given the relatively small number of entities who could train the largest models it does seem possible to coordinate so I'm not sure how how you would maintain this.

21:32

This equilibrium for a long period of time but I think if we got to that point we would be in an okay position would be I guess I'm curious like I'm not sure what happens next because like fundamentally the problem or the benefit is that like we've got a ton of like you like push it to the server and now we've got a bunch of intelligences or they can push themselves to the server.

21:55

And now we got already coordinated but I'm not sure what what we do next in this in this world we're like why that why that's this is a for good outcome. Yeah I would say if we had everyone reasonably coordinated we could figure out some and we felt like we had solved the technical problems around alignment well enough to be able to deploy like really smart AIs that can like.

22:22

Like act as an extension of people's will but also prevent them from being misused in some way that would cause a catastrophe I think then then that would be great like we could go ahead and like safely deploy these systems and it would it would usher in a lot of prosperity and a new like much more rapid phase of scientific advancement and so forth.

22:50

So I think that would be what the good scenario would look like okay so that's the that makes sense but I'm curious like how would you know in a couple years if. You you like all these actors even in the best case scenario they agreed to pause until we figured out that we're building a line systems that. Are not themselves going to attempt to take over or not going to enable somebody else to do that how what would proof of that look like or what would evidence that look like.

23:20

Well I would say if we if we can deploy like systems incrementally that are successively smarter than the ones before then I think that's safer so I hope the way things play out is is it's not the scenario where everyone has to coordinate and lock things down and safely release things. Like because it would like lead to this big build up in potential energy yeah potentially so I would rather some scenario where we're just.

23:48

Continually releasing things that are a little better than what came before and then we while like making sure we're. Confident that each diff is right like improving improving the safety and alignment in like. Correspondence to the improvement and capability so and if if things started to look a little bit scary then we would be able to slow things down so that's what I would hope for.

24:17

I would say if there's more of a discontinuous jump and the question is how do you know if the thing you've got is safe to release. I would say I can't give a generic answer like I would want to but like the type of thing you might want to do to make them more more acceptable would be you would want to do a lot of testing like simulated deployment.

24:46

You where that you expect so bread teaming of sorts like you'd want to do that in a way that you feel is like much less favorable than or much more likely to fail than the thing you're planning to do it in the real world.

25:02

So you'd want to have a really good monitoring system so that you can like if something does start to go go wrong with the deployed system you can you feel like it's going to be detectable immediately like you've got maybe you've got something watching over the deploy day eyes and what they're doing and looking for signs of trouble.

25:22

So I would want to yeah I would say just you'd want some defense in depth like you'd want to have some combination of like the model itself seems to be like really well be fit if you have to have like impeccable moral compass and everything and you're pretty confident that it's extremely resistant to any kind of take over attempt or something or like severe misuse. And then you'd also want to have like really good monitoring on top of it so yeah you could detect any kind of any trouble.

25:56

What are you keeping track of while you're doing long horizon or when you eventually start doing it that you you could notice this sort of discontinuous jump before you deployed these systems broadly. I would say you would want to have a lot of a valve say you're running during the training process and like what specifically would it.

26:14

How would you notice something like yeah and I mean does it make sense to train on along horizon or knowing that this is something that could happen or is it just like a very low possibility. How do you think about this you'd want to be pretty careful when you do this kind of training if you see a lot of potentially scary capabilities.

26:34

I mean if those seem close I mean like I would say it's not something we would want to we have to be scared of right now because right now it's hard to get the models to do anything like coherent but if they started to get really good I think. I think we would want to we would have to take some of these questions seriously and we would want to have a lot of a valves that like sort of test them for misbehavior in the most or I guess that's like for the alignment of the models we want to check.

27:08

They're not going to sort of turn against us or something but you might also want to look for like discontinuous jumps and capabilities like. You'd want to have lots of a valves for the capabilities of the models. I mean also I guess you you'd also want to make sure that whatever you're training on doesn't have any reason to make the model turn against you which itself I think isn't. I would say there's like. That doesn't seem like the hardest thing to do I mean if.

27:48

The way we train them with our l hf that that does feel even though the models are very smart it does feel very safe because the model is just trying to produce a message that is.

27:59

But it's pleasing to a human and it has no concern about anything else in the world other than whether this texted produces is approved so obviously if you were doing something where there's where the model has yet's carrying out a long sequence of actions which involve tools and everything then, it might have some incentive to do a lot of wacky like wacky things that wouldn't make sense to a human in the process of producing its final result but I guess.

28:27

I guess it wouldn't necessarily have an incentive to do anything other than produce a very high quality output at the end. So it's not, yeah, so I guess you have these old points about like instrumental convergence, like the model is going to want to take over the world so it can produce this awesome

28:46

piece of code at the end. Like if you ask it to write you a flask app, it'll be like, oh yeah, first I need to take over the world and then I need to, I don't know, but at a certain point, it's a little bit, it's a little hard to imagine why for some like fairly well specified tasks like that, you would want to first take over the world. But of course, yeah, if you had a task like make money, then maybe that would lead to some nefarious behavior as an instrumental goal.

29:18

Yeah. Okay, so before we get back to that, I think let's step back and talk about like today's RLHF systems and everything. But I do want to follow up on that third point. It's kind of interesting. Okay, so today's RLHF, the way in which it influences these models is would you characterize it as in terms of human psychology? Is it a drive? Is it a goal? Is it an impulse? Like

29:43

psychologically what kind of thing in what ways it being changed? And not just like the persona of a chatbot, but just like don't talk that way, talk this other way or don't put those kind of outputs. Yeah, I would say there are probably some analogies with a driver or a goal in humans. So in that you have, you're trying to steer towards a certain set of states rather than some other states. And so I would think that our concept of a driver or a goal has other elements like the feeling

30:18

of satisfaction you get for achieving it. And those things might be more like have more to do with the learning algorithm than what the model does at runtime when you just have a fixed model. So I would say there are probably some analogies though it's, I don't know exactly like how close it is, but I would say to some extent it is the models do have drives and goals in some meaningful way. And in the case of RLHF where you're trying to maximize human approval as measured by a reward

30:55

model, the model is just trying to produce something that people are going to like. And they're going to judge us correct. I've heard two ideas in terms of using that in a model out type of thing to get better at reasoning, at least publicly the kinds of things I've seen. And I'm curious to which you think is more promising. One is that the model learns from it outputs a bunch of potential trains of thought. And it learns to follow the one that leads to the correct answer

31:26

and is trained on that before deployment. And the other one is you use a bunch of compute to do inference in deployment, which involves the model talking to itself, you know, while it's deployed. Which one do you expect it to be closer to when it's like really good at reasoning? Is it because it's doing just a bunch of inference clouds? It's just because you've trained it to do all of that. Well, I would say you could define reasoning as tasks that require some kind of

31:53

computation at test time or maybe some kind of deduction. So by definition reasoning would be tasks that require like some test time computation and step-by-step computation. On the other hand, I would also expect to gain a lot of like doing some kind of training time computation or practice at training time. So I would think that you get the best results by combining these two things. So my uncle had prostate cancer and I wanted to know my own risk. And I got a 23 in me test,

32:35

but it was mostly useless. I mean, the whole, what are your odds of like, in chocolate is not what I was looking for. So I expert in my data onto nucleus genomics. And immediately I got my risk profile for almost two dozen diseases, including prostate cancer. And it turns out my risk is higher than 97% of people with my ancestry, which is a very useful thing to know because I know I know to get screened early. Ironically, test like 23 in me don't even

33:00

look at the variants which have the largest impact, including for prostate cancer. Many people don't know this, but 23 in me looks at less than 0.1% of your DNA. And that's why I've for you ordered nucleus premium pole genome sequencing. It's the only clinical great test that reads 100% of your DNA. I've spent a lot of time digging into this company and I think will be a big change in what we can get at a genetic test. So if you want to live a long and healthy life, you can preorder

33:27

nucleus premium at my nucleus dot com. All right, back to John. Right now, you know, you have these two ways in which the model learns it's either in training, whether it's retraining or with the post training, but it's like most of the compute in training is spent on retraining and just glossing over trillions of tokens, just like standing by as they like almost like skimming trillions of tokens worth of information, which if a human was

33:56

subjected to that would just be totally confused, right? It's like not a very efficient way to learn. And the other way is in context learning, but of course, that is more sample efficient there, but it's destroyed with each instance. I'm curious if you think that there is a path for something in between those where it's not destroyed with each instance, but it's also not as not as sort of frivolous as just seeing trillions of tokens where it's more deliberate and active.

34:28

Yeah, so do you mean models having some kind of medium term memory? So too much to fit in context, but like much smaller scale than pre-training? I'm not sure if memory, it might be memory. I don't have context, but certainly like when I'm when I'm trying to prepare for this conversation, it feels like I think of like what I should understand this so I look it up and I like read it carefully and I maybe think about it as I'm reading it. And I'm not sure what it naturally

34:56

corresponds to in terms of models, but what that looks like I'm curious. I see so it's not just a memory, but it's also somewhat like specializing to a task that specializing to a certain task or putting a lot of effort into like some particular project. And I'm not sure the specialization more so I'm thinking about I don't understand this part. So let me look into this part deeper. I already understand this. I'm going to like specializing to your existing knowledge base.

35:24

I see so it's not just about finding like I don't know training on a bunch of sources that are relevant of fine tuning on some special domain. It's also about like like reasoning about like developing some knowledge through your own reasoning and also yeah using some sort of introspection and self-knowledge to figure out what you need to learn. Yeah.

35:46

Yeah, I would say that does feel like something that's missing from today's systems. I mean, I would say people haven't really pushed too hard on this middle ground between like large-scale training like where you produce the like this snapshot model that's supposed to do everything like a deployed model and then like on the other hand like in context learning. And I think part of that is that we've just been increasing context length so much that there hasn't been an incentive

36:16

for it. So if you can go to like that 100,000 or a million context then that's actually quite a lot and it's not actually the bottleneck in a lot of cases but I agree that you would probably also want to supplement that by some kind of fine tuning like the capabilities you get from

36:37

fine tuning and in context learning are probably somewhat complimentary. So I would expect us to want to build systems that do some kind of online learning and also have some of these cognitive skills of like introspecting on their own knowledge and seeking out new knowledge that fills in the holes. Is this all happening at the same time? Is it just like a new training regime where all these things can happen at once or whether it's the long horizon training or whether it's this kind

37:11

of training? Are they separate or are they just because like the model is smart enough so they can both introspect and it can act on longer horizons than you can get adequate reward on long horizon tasks? Yeah I would say if you're doing some kind of long horizon task, well I would you're learning while you do the task right so the only way to do something that involves

37:34

a lot of steps is to like to have learning and memory that gets updated during the task. So like there's a continuum between like like shorter memory between short term and long term memory. So I would say yeah I would expect this capability would start to become like the need for it would start to become clear when we start to look at long horizon tasks more and to some extent just putting a lot of stuff into context will take you pretty far because we have really long

38:17

context now but you probably also want things like fine-tuning and as for like introspection and the ability to active learning that might like automatically fall out of the models abilities to know what they know because they have some like models have some calibration regarding what they know and that's why like that's why models don't hallucinate that badly because yeah they have some understanding of their their own limitations. So I think that like same kind of ability could

38:52

be used for something like active learning. And how so there's all these complicating our all procedures that many of whom you've pioneered how many of them will be relevant when you get to the point where the model itself is this smart that it can act as a certain environment and interacting them more online and a stable way is it is is a path for progress going to be more straightforward than the kind of solutions that were required for our own the past.

39:26

Well I think policy grading algorithms are not the most sample efficient algorithms so that's probably not what you want to do at test time if you want to learn really fast but though who knows I mean maybe it's not that bad. So I think something like like motor learning in animals is probably something like a policy grading algorithm and so for example you're like learning how to shoot baskets I think you probably like that takes maybe thousands of tries to get more accurate and I

40:01

think you probably there's probably something that's like a policy grading algorithm underneath. But that's not going to be the fastest way to learn in like if you have a model trying to do a project or some kind of task. So I would think we want to rely more on like in context learning where you effectively have a learned algorithm like you've learned how to explore like you've learned how to try all the possibilities exhaustively and instead of doing the same thing over and over again

40:36

making the same mistake. So yeah I would say we'll be able to do things that look more like learn search algorithms and that'll be the kind of thing that gets used in a particular task.

40:49

Interesting. All right I want to step back and ask about your own history so at least at opening I so you let the creation of chat GBT at all point do you realize first of all these LLMs are the path to go and then a chatbot would be or some way to instruct them would be a useful thing to do just walk me through the whole lineage from like when this became your main

41:13

focus and yeah well yeah well but the process was light. Yeah so early so we had before chat GBT we had open AI had these instruction following models and that was the idea there was we had base models and people can prompt them in elaborate ways but they're also kind of hard to prompt you had to they basically do autocomplete so you have to set up a very good prompt with some

41:44

examples. So people at open AI were working on just taking the base models and making them easier to prompt so that if you just wrote a question it would answer the question instead of giving you more questions or something. So that was so we had these instruction following models which were kind of like base models but a little easier to use and those are the original ones deployed in the API or after GPT3 those were the next generation of models then at the same time there were

42:18

definitely a lot of people thinking about chat so so Google had some papers like they had Lambda and earlier Mina so they had these chatpots and it was more like like you had a it was more like a base model that was really specialized to the task of chat really good at chat and like I think at least looking at the examples from the paper it was more used for sort of fun applications like where the model would like take on some persona and pretend to be that persona it was not so

42:53

functional like like help me refactor my code. So yeah there are definitely people thinking about chat I had worked on a project before looking at chat called web GPT which is more about doing question answering with the help of web browsing and retrieval and well when you do question answering

43:16

it really wants to be in a chat because you always want to ask follow up questions or sometimes you need a cloud the model should ask a clarifying question because the questions ambiguous so it's kind of clear after we did the first version of that that we should the next version should be

43:32

conversational so anyway we started working on like the conversational chat assistant and we this was built on top of GPT 3.5 which was done training at the beginning of 2022 and that model was quite good at language and code so we quickly realized that it was actually

43:55

quite good at coding help and that was one of the things we were excited about. So yeah we worked on that we worked on that for most of the year and we had we had browsing as another feature in it though we ended up like deemphasizing that later on because the like the model's internal knowledge

44:14

is so good that we didn't that the browsing wasn't the most interesting thing about it and then we were thinking about we had it up for beta testing or two friends and family for a while and we were thinking about doing a public release but at that time actually GPT 4 finished training

44:34

in August or yeah in August that year and actually the like the flagship RL effort at OpenAI was the instruction following effort because that was the models that were being deployed into production so like the first fine tunes of GPT 4 used that that whole stack and that was

44:58

yeah those models were really good and everyone got really excited about that after seeing the like instruct fine tune GP4s but so they were really really good they would occasionally give you amazing outputs but they were also like a little bit the model was clearly like pretty unreliable

45:14

like it would sometimes hallucinate it a lot and it was like pretty it would sometimes give you pretty unhinged outputs so it was clearly not quite ready for prime time but it was like obviously very good and yes I guess that people forgot about chat for a little while after that because

45:33

about this like alternative branch but then we ended up we pushed it further and we ended up like mixing together all the data sets like the instruct and the chat data and to try to get something that was the best of both worlds and I think the the models we the chat models were like

45:49

what were clearly more like it was an easy easier to use it was sort of more it sort of like automatically had much more sensible behavior in terms of like the model knowing its own limitations that was actually one of the things that I got excited about as we were developing it that

46:09

like I realized a lot of the things that people thought were flaws in language models like just like blatantly hallucinating could be not completely fixed but you could make a lot of progress with pretty straightforward methods oh yeah and also the the other thing about chat was that

46:30

like when we had these instruct models like the task of complete this text put in a nice way or in a helpful way that's like a pretty poorly defined task so I think like I think that task is like both confusing for the model and for the human who's supposed to do the data labeling

46:46

whereas for chat I think people had an intuitive sense of like what a helpful robot should be like so I think it was just much easier to tell people like to to give for people to get the idea of what what the model was supposed to do yeah and so that so as a result I think the like the model

47:07

had a much more coherent personality and like it was much like easier to get like robot like pretty sensible behavior robustly interesting is it the case that anybody could have made chat GBT using your publicly available fine tuning API?

47:28

Not exactly I mean they could have I don't remember the status of which models were available for fine tuning you assuming we had 3.5 available for fine tuning at the time you could have made something pretty decently close but I'm not sure you would have I don't think you would have been

47:49

able to do just one iteration of fine tuning where you have like purely human written data and you fine tune on that I think you would want like you want to do several iterations that like if you're not going to do RL which which we did you'd want to do some kind of iterative supervised fine

48:07

tuning where you have like humans edit the model generated outputs because it's really hard to get people to like if you train on human generated data even if it's really high quality it's just hard for a model to fit that data perfectly because it might not be like it might not be something a

48:22

model is capable of outputting so you need to do something iterative that looks a little bit more like RL so I think if you had done that you could have gotten something pretty close but that would have been kind of non trivial but we also had another like instruction following

48:42

model trained with RL that was released a little before Chats DBT so I think if you put a chat like wrapper on that you would get something decently close but it like that model like if you just prompted it with chat so but that model had some differences in strengths like it was like that

49:03

model was pretty good at writing and poetry and so forth but it wasn't it sort of it wasn't as good at knowing its limitations and at factuality and so forth so I was stepping around from 3.5 I think I heard you somewhere say GPT 2 you're super impressed compared to your expectations in 2019

49:22

has AI progressed faster or slower than you would have expected I would say faster than I would have expected since GPT 2 yeah I was pretty like bought into scaling and yeah pre training and so forth being a good idea but when GPT 2 was done I was I would say I wasn't completely sold on

49:46

it being revolutionizing everything like I only really pivoted what I was working on and what yeah what my team was working on in after GPT 3 so after that we kind of got together and said oh yeah let's you let's this language model stuff works really well let's see what we can do here but

50:06

yeah after GPT 2 I wasn't quite sure yet especially if the stuff we were talking about earlier with RL starts working better with these smarter models with a fraction of compute that has spent on training that is free training versus post training change significantly in favor of post training

50:26

in the future yeah there are some arguments for that I mean right now it's a pretty lopsided ratio but you could argue that the output generated by the model is like high quality compared to or higher quality than most of what's on the web so it sort of makes more sense for the model to

50:45

think by itself instead of just like training to imitate what's on the web so I think there's a first principles argument for that and I would say we found a lot of gains through post training so I'm not sure so I would expect us to keep like pushing this methodology and probably increasing

51:08

the amount of compute we put into it the current GPD4 has a EOS code that is like a hundred points higher than the original one that was released and is that all because of what you're talking about with these improvements that are brought on by post training or yeah I would say that we've

51:28

I would say that most of that is post training interesting so there are a lot of there are a lot of different separate axes for improvement like you can yeah so we think about like data quality data quantity just doing more iterations of the whole process of deploying and collecting new data and like changing what your what kind of annotations you're collecting so there's a lot of a lot of things that stack up but together they give you a pretty good like effective compute increase.

52:01

Yeah I mean that's a huge increase that's like really interesting that there's this much this much room for improvement from post training what is what what makes for somebody who's really good at doing this sort of our research I hear it's super finicky but like what is the

52:20

sort of intuitions that you have that enable you to find these ways to mess with the data and to set up these environments I'd say I just have a decent amount of experience at this point from like the different parts of the stack from like RL algorithms obviously since since I've worked on

52:42

those since grad school to like the data collection like the annotation process to like language playing with language models so I mean I'd say I just dabbled with these things and I'd say the people who do well at this kind of research have some view of the whole stack and have a lot of curiosity

53:07

about the different parts of it and also sort of think about well you want to be both empirical and like use experiment let experiments update your views but you also want to think from first principle somewhat like what like assuming that like learning works like what would be the

53:30

ideal type of data to collect yeah that's sort of thing so because there doesn't seem to be a model released since GP-4 that seems to be significantly better there's seems to be the hypothesis that potentially we're hitting some sort of plateau and that these models aren't actually

53:46

generalizing that well and you're going to hit some sort of data wall beyond which point the abilities that are unlocked by memorizing a vast corpus of free training data won't actually help you get something much smarter than GP-4 what do you think the hypothesis is that wrong and like

54:06

I think we talked about some examples generically about generalization the Spanish to English and so forth but is there yeah I mean okay so maybe this is a run on set question but well one one one example I was thinking of was the idea that there's transfer from

54:27

language code reasoning and code betraying a bunch of code it gets better reasoning and language and if that's is that actually the case do you see things like that which suggests that there's all the scripted positive transfer between different modalities so once you try to train training

54:41

on a bunch of videos and images it will get smarter and it'll get some out of different synthetic data or does it seem like the abilities that are unlocked are extremely local to the exact kind of labels and data you put into the the training corpus yeah okay yeah I'll try to have a response

54:57

to that so first are we about to hit the data wall I mean I wouldn't draw too much from the time since gpd4 was released because I mean it does yeah it takes a while to like train these models and to like get all the do all the prep to train a new model like generation of

55:22

models so yeah I wouldn't draw too much from from that fact I would say there definitely some challenges from the limited amount of data but I wouldn't expect us to immediately hit the data wall but I would expect the nature of pre-training to somewhat change over time as we get closer

55:44

closer to it in terms of like generalization from different types of pre-training data I would say it's pretty hard to do science on this type of question because you can't do that create that many pre-trained models so maybe you can't train a like a gpd4 size model you can't

56:06

do ablation studies at gpd4 scale maybe you can do like train a ton of gpd2 size models or maybe even a gpd3 size model with different data blends and see what you get so I'm not like aware of any results like public like public results on like ablations involving code data

56:28

and reasoning performance and so forth so that would be I'd be very interested to know about those results but I'm actually curious about I mean if one of the things is that the model gets motorized is bigger what in the relation on a gpd2 level model which suggests that there isn't that much transfer how much evidence is that provide for the level of transfer on a similar set of domains in the gpd4 level model right you might not be able to conclude that if transfer fails at

56:59

gpd2 size then it's also going to fail at a higher scale so it might be that like for the smaller models you yeah for the larger models you learn these better shared representations or the smaller models have to lean too much on memorization whereas the larger models can learn how

57:20

to do the right computation so I would expect this to be true to some extent this might have a very simple answer but so bigger models you train them on the same amount of data and they become smarter or conversely they can to get the same amount of smarts you have to train them

57:39

on less data what why is that the case like it's got more parameters it's all less things and that was equally smart what why is that the case I don't think anyone has a good answer for a good explanation of the scaling law with parameter count I mean there's some

58:00

I don't even know what the best sort of mental model is for this like clearly you have more capacity if you have a bigger model but so like you should be able to eventually get a lower loss but I guess why are bigger models more sample efficient I guess you could I can

58:19

give you some like very sketchy explanation like they have like you could say that the model is like sort of an ensemble of a bunch of different circuits that do the computation so it has like you could imagine that it's doing it has a bunch of like computations that it's doing in parallel

58:41

and it's like doing some like the output is a weighted combination of them and if you have more just width of the mock or if you just have I mean actually width is somewhat similar to depth because like with residual networks you end up like the depth can do something similar to width

59:01

in terms of like updating what's in the residual stream but if you yeah you could argue that you're learning all these things in parallel you're learning all these different computations in parallel and you just have more of them with the bigger model so you have more chance that one of

59:18

them is lucky and ends up like having high like like winning guessing correctly a lot and getting up weighted so that's kind of like what would be the yeah there's some algorithms that work this way like that like mixture what is it mixture some kind of mixture model or multiplicative

59:46

weight update algorithm yeah there's some algorithms that kind of work like this so where you have like a some kind of mixture of I don't want to say mixture of experts because it means something different but like basically a weighted combination of experts with some learned gating and

01:00:02

um um actually anyway I said something slightly wrong but anyway uh yeah you you can imagine something like that and just having a bigger model gives you more chances to get the right function so that would be um and then of course it's not just like you have a bunch of like

01:00:21

totally disjoint like functions that have uh you're taking a linear combination of it's more like a library where uh you might chain the functions together in some way so uh you like it's there's some composability um so yeah so I would just say there's like um the bigger model has a bigger library

01:00:40

of different computations including lots of stuff that's kind of dormant and only being used some of the time but those thing but it has like more space to look for the like look for those circuits to do something useful I want to ask you about um uh stepping back from the current

01:01:01

research questions just stepping back I want to understand just sort of like modal scenario of what happens for the next few years I think uh two towards the beginning of the conversation we were talking about the case in which the progress is really fast but just like let let just

01:01:14

take like the modal scenario um you're unlocking long horizon r i'll at some point but then as you said there's potentially other bottlenecks so what's happening you know uh how good are these models how are they being deployed what other modalities are part of them at what at what stage are these

01:01:32

being unlocked and so far then just kind of want to understand your broader picture of what the next few years look like yeah I would expect um I would expect things like okay new modalities to be added like um over time or uh pretty soon um I would expect the capabilities to generally

01:01:53

keep getting better through a combination of pre training and post training and that'll open up new use cases so right now uh AI is still um not a huge part of the economy like there's a pretty small fraction of uh jobs that it can help with at all um so I expect that to be higher

01:02:11

over time and not just from the models uh improving also from people just figuring out how to integrate them into different processes so even even if we just um froze the models at their current uh state um I think you would still see a lot of growth in how they're being used um so

01:02:29

I'd expect there to be a lot of um like I would expect AI to be um used much more widely and um I would expect it to be used for more um kind of technique like technically sophisticated tasks like yeah like I gave the programming example earlier um of doing like longer projects but also

01:02:54

helping with um various kinds of uh research so I hope that uh we can use um AI to accelerate science in various ways and uh just um like because you can potentially have the the models like understand all of the literature in a given field and be able to like uh be able to sift through tons of data

01:03:17

um like more than a person would have patients to do so I would hope that we can basically uh like yeah um well I hope the form factor would basically be that people are still driving all this and you have your like helpful assistance that you can use you can sort of direct and point to

01:03:36

lots of different problems that are useful to you and everyone sort of has all these uh AI's helping them uh helping them do more get more done hey everybody real quick I want to tell you about a tool that I wish more applications used so obviously you've noticed every single company is trying

01:03:56

to add an AI chatbot to their website but as a user I usually find them really annoying because they have these long generic often useless answers command bar is a user assistant that you can just embed into your website or application and it feels like you're talking to a friendly human

01:04:14

support agent who's browsing with you and for you and it's much more personalized than a regular chatbot it can actually look up users history and respond differently based on that it can use APIs to perform actions it can even practically nudge users to explore new features one thing that I think

01:04:33

is really cool is that instead of just outputting text command bar can kind of just say here let me show you and start browsing alongside the user anyways there's a bunch of great products already you can learn more about them at commandbar.com thanks to them for sponsoring this episode

01:04:51

but obviously at some point they're going to be better than everyone at whatever they want to do so yeah well what will process like right now they're clearly only helping you at some point they're able to just do things for you and maybe like run entire firms for you or whatever at that point

01:05:11

is it yeah is it just going to be like a smooth process and at that point the hope is that we have systems that are aligned with the user enough that they can count on the firm being run in the way they expect and so forth yeah I think well we might not want to jump to having AI's run

01:05:29

whole firms immediately I mean we might want to have people like overseeing like overseeing these like important decisions and calling the shots so even if the models are good enough to like to actually run a successful business themselves so yeah to some extent there might be choices there

01:05:57

and I think people will still have different interests and what they want to different ideas for what kind of interesting pursuits they want to direct their AI's at and like they can people people could like yeah do a lot of AI doesn't necessarily have an intrinsic like any kind of

01:06:20

intrinsic desire not yeah we put we put it in the system so I think so people can still end up being even if AI's like become extremely capable I would hope that people are still the drivers of like what the AI's end up doing yeah but I wonder if the economic equilibrium is so far from that

01:06:42

where you have the equivalent of Amdahl's law in a firm the slowest part of the process is the one that's going to bottleneck you and so you know the AI makes all the non-human parts of the firm 10x more efficient the firm can no longer you know it's it's still bottleneck by that step and so if in the if like one company decides to proceed by keeping humans in the loop on all the things that you really want to human oversight on then they'll just be out competing with other companies if one

01:07:11

country decides to go this route other countries will be that this doesn't seem I hope this is like yeah I wonder if this is a sort of sustainable plan for keeping humans in the loop right so I think if you if we wanted to keep humans in the loop which seems reasonable and it turned out that firms

01:07:33

with any humans in the loop were out competed with five firms that didn't have any humans then I think then you would obviously need some kind of regulation that like disallowed having no humans in the loop for running a whole company but there's so many companies in the well I guess in any

01:07:51

country but a little let alone the world but the yeah I wonder if it's better to do the regulation on companies and say like you've got to keep humans in loop in important processes but then you had to find what important processes are you got to monitor every single company and you also got

01:08:07

to get collaboration in every single country which has firms and it versus if this is a problem should have been solved before the model is even deployed such that hopefully you would get into a situation where you did decide to build a firm and and on these models it's basically does what

01:08:26

you want it to do and you don't need a human in the loop does that question makes sense like I guess I'm wondering in this situation right it's how do we actually monitor every single firm as a human in the loop and what happens if like China doesn't decide to do that and so forth right

01:08:40

yeah you would either have to have like every country agree to this regulatory regime or you would need every you need all of the model infrastructure or the model providers to agree to this kind of requirement so it's definitely going to be non-trivial so I guess yeah this is looking a ways ahead

01:09:05

so it's a little hard to imagine to imagine this world before seeing anything anything like it but so for example like there's some questions like would are we actually confident that AI run companies are better in every way or do we think they're better most of the time but occasionally

01:09:29

they malfunction because AI's are still like they're still less sample efficient in certain ways like dealing with very wacky situations so so actually AI run firms have higher tail risk because they're more likely to malfunction in a big way so I guess that there might be some

01:09:47

question practical questions like that that would that would also determine how things play off like play out like maybe maybe if you just require people to be accountable for various like liability this would also change the incentives a bit so if it turned out that like AI's are better

01:10:06

running everything and they're also completely benevolent and we've like totally solved alignment and we can like they're better at being accountable to like they're to people than people are then I would say maybe maybe it's okay having the AI's run the firms but I think that's that might be

01:10:26

pretty far out and I think we we're more likely to be in a situation where they look better like in the short term but they still have some problem like the AI run entities still have some serious problems and it's actually like practical considerations that push you more towards having

01:10:42

humans in the loop at least within your future okay so this is a problem you have to deal with today with RLHF where you have to aggregate preferences across a lot of different humans and it'll be maybe more marked with future more powerful systems but when you say well we want

01:10:59

these eventual AI systems that are going to fully replace humans as part of these firms to be aligned what does that mean like will it mean that they're basically we do what the user wants them to do it doesn't mean that they have to result in some sort of global outcome that we're happy with as

01:11:17

the kind of people with the stakeholders in opening AI like what concretely would that mean if the models are being used it like for these higher stakes use cases then we would have to think about RLHF in a much different way than we are right now so I would say we're not quite

01:11:38

yeah we're not quite ready for that or the current methods might not be completely sufficient but I would say we would need to make compromises between the needs of the different stakeholders involved so we have this document that we're releasing called the model spec and it's about how we want

01:12:02

our models to behave in the API and in chat GBT and we sort of we try to talk about this issue where there are different stakeholders involved and sometimes there conflicts between what they might want like like in our case we were thinking of the stakeholders as the user or the end user that means

01:12:24

like someone sitting in front of chat GBT or some other app the developer so this is like someone using the API who might be serving other end users with their app like the platform which is opening AI like we don't want the models to expose like expose us to legal risk and so forth

01:12:47

and then the rest of the humanity including people not part of the like who might not be users or customers or anything so obviously like the user might ask ask the model to do something that we think is like actively harmful to other people and so we might have to refuse that by the way this

01:13:12

isn't the order of priority necessarily so this is just like we have these four or so classes of stakeholder actually you could also say maybe in the future we'll say the model itself the model itself so I would say we're not going there yet but anyway they yeah we have these different stakeholders

01:13:31

sometimes they have conflicting demands and we have to make some call on how to resolve those conflicts and it's not always obvious how to do that so I would say we had to think through yeah we just had to think through the trade-offs and basically the like the rough

01:13:51

heuristic is that we mostly want the models to follow your instructions and be helpful to the user and the developer but when this impinges on other people's happiness or a way of life this becomes a problem and we we have to block certain kinds of usage but we don't want to be too we

01:14:16

mostly want the models to just be an extension of people's will and do what they say we don't want to be too paternalistic we want to be kind of neutral and not like impose our opinions on people yeah we want to both mostly let people do what they want with the models I got a chance to

01:14:36

read the spec beforehand and it was I guess is a question of how well that transfers over to how the model itself behaves but the I was impressed with how sensible the trade-offs were like it made sense that this is the I was like explicitly stated the actual edge cases rather than the kinds of

01:14:57

things where everybody can which are obvious like in this case you really are going after the edge cases yeah we wanted it to be very actionable so that it wasn't just a bunch of nice sounding principles but it was like each each example kind of tells you something about some non-obvious

01:15:11

situation and reasons through that situation yeah okay now I have a couple questions about the the state of the research itself so famously in the social sciences things are really hard to replicate and it's a question about how much of the science there is real versus these

01:15:30

manufactured bespoke sorts of experiments when you look at the average ML paper does it feel like the like a really solid piece of literature does it feel often like it's the equivalent of what P hacking is in the social sciences everyone has their complaints about the

01:15:48

ML literature but I would say overall I think it's a relatively healthy field compared to some other ones like in the social sciences just because well it's grounded it's largely grounded in practicality and getting things to work and if you if you publish something that can't be replicated

01:16:13

easily then people will just forget about it so and it's like accepted that often you you don't just report someone's number from their paper you also try to re-implement their method and compare it to your method on the same say on the same training data set so I think if you if you

01:16:30

publish methods that are like really hard to implement or don't or are really finicky they'll tend to get forgotten and as a result people actually try to open source their work a lot I guess there's also there's various some like incentives that there's various unfavorable incentives like

01:16:53

yeah people are incentivized to make the baseline methods like the methods are comparing to worse and like there are other like mild pathologies like trying to make your method seem sophisticated mathematically but I would say overall I feel like the field makes progress and I would probably like

01:17:14

to see a little bit more science and trying to understand things rather than more like hill climbing on benchmarks and trying to propose new methods and there's been a decent amount of that recently but yeah I think it's we could use more of that and I think that's a good thing for

01:17:33

like academics to work on oh yeah and the social science is on a slightly different note I think actually I'd be really excited to see more research and using base models to do simulated social science because these models have a probabilistic model of the whole world and you can set up like

01:17:58

a simulated questionnaire or like a conversation and like and you can look at how anything is correlated like any any traits that you might imagine you can see how they might be correlated with other traits so it would be pretty cool to see if people could replicate some of the like more notable results in the social science is like like moral foundations and that sort of thing by just like prompting base models in different ways and seeing what's correlated. What is that standard

01:18:28

experiment? The the one where they can ask conformity test right? Yeah maybe find it if that replicated with the language models as well. I'd be interesting. With the rest of the research that happens at big labs how much of it is increasing the or decreasing the amount of compute you need to get a certain result as an actual compute multiplier versus how much of it is things that are just

01:18:55

making the learning more stable and just building out of the infrastructure. I guess the broader question I'm going to ask is since GPT-4 does it feel like with the same amount of compute can train a much better model or does it feel like oh we've like made sure that the learning can happen better and in a more scalable way would GPT-5 but it's not like we can train GPT-4 with like GP3.5 budget now or something like that. Yeah well definitely there's always progress in improving the

01:19:22

efficiency. Whenever you have a 1D performance metric you're going to find that like different improvements can kind of substitute for each other so you might find like you might find that you have post-training and pre-training both improve the metrics or like improve they'll have a different slightly different profile of which metrics they improve but if at the end of the day you have a single number they're both going to they're going to substitute for each other somewhat.

01:19:55

So I would say for something like a like a human evaluation like what a humans prefer we've definitely made a lot of progress on both sides and like pre-training and post-training and improving that. A couple of rapid-fire questions about RLHF so obviously RLHF is important to make these models useful so maybe the lobotomized description is inaccurate but there is a sense in which all of these models once they're put in a chat platform have a very similar way of speaking.

01:20:28

They really want to delve into things they want to turn things into bullet points. They often seem sort of have this formal and dull way of speaking and there's complaints that they're not as creative like what we're talking about before with it can only do rhyming poetry and not rhyming until recently I guess. Is that a result of the particular way in which RLHF happens now and if so like is it because of who the Raiders are is it because of what the loss function is what why is this the

01:20:56

way all chat bots look? Yeah I would say there's a decent amount of room for variation in exactly how you do the training process and I think we have a lot of I'd say we're actively trying to improve this and make the writing more lively and more fun and I think we've made some progress like improving the personality of chat GBT so it is it is more fun and like it's it's better when you're trying to chitchat with it and so forth it's less robotic. I would say yes it's a kind of

01:21:30

interesting question how some of the the ticks came about like like the word Delve. I've actually caught myself using the word of it recently though I don't know if it robbed off on me from from the model or what but actually I think there's also there might be some funny effects going on

01:21:49

where there's like unintentional distillation happening between the language model providers where like if you hire someone to go do a labeling task they might just be feeding feeding it into a model they might just be pulling up their favorite chat bot and like feeding it in

01:22:08

and having the model do the task and then copying face to the back so there might be that might account for some of the convergence but also I think some of the things we're seeing are just what what people like I mean I think people do like bullet points they like the structured responses

01:22:25

people do often like the big info dumps that they get from the models so yeah I think there's so it's not completely clear how much is just a quirk of the particular like choices and like design design of the post-training processes and how much is actually intrinsic to like

01:22:54

what people actually want it doesn't persistently more verbose than some people want it maybe just because during the labeling stage the raiders will prefer the more verbose answer but I wonder if it's if it's inherent to because of the how is free trading the stop sequence doesn't

01:23:15

come up that often and like it really wants to just keep going or there might be some biases in the labeling that lead to verbosity like the fact that we tend to train for one message at a time rather than the full interaction so like if you only see one message then there's something that

01:23:33

just has like a clarifying question or maybe a short response with an invitation to follow up is going to be it's going to look less complete than something that covers all possibilities there's also a question of what people whether people's preferences would change depending on how

01:23:51

fast the model is streaming its output like like clearly if you're sitting there waiting for it waiting for the tokens to come out you're going to prefer that it gets to the point yeah but if it just gives you like a dump of text instantly maybe you don't actually care if

01:24:06

there's a bunch of boilerplate or like if there's a bunch of stuff you're in a scheme you'd rather just have it all there yeah the reward model is I think such an interesting artifact because it's the closest thing we have to an aggregation of what people want what preferences they have

01:24:27

when you think about models that are much smarter the kind of way in which will I mean one hope would be that you could just give a sort of like list of things we want that are not a sort of trivial and obvious kinds of like you on a declaration of rights things on the other hand I think I heard

01:24:48

you make the point that well a lot of our preferences and values are very subtle and so that they might be best represented through these pair wise preferences when you think of a GPD6 or GPD7 level model are we giving it more of like a written instructions are we still doing which kind

01:25:07

you know these sorts of like subloan old preferences yeah that's that's a good question so I think like these preference models do learn a lot of subtleties of yes subtleties about what what people prefer that are would be hard to articulate in a like in an instruction manual yeah

01:25:26

um maybe if you um like uh obviously you can write an like an instruction manual that has lots of examples of comparisons um and that's like that's what the model spec has it has a lot of examples with some explanation um so it's not clear what the optimal format is for um describing uh preferences I would guess that whatever you can get out of uh like a big data set that captures fuzzy preferences you can uh distill it down to a uh like a smaller a shorter document that mostly

01:26:03

captures the ideas and uh and I would think that the big uh like like the bigger models are like they do like uh learn a lot of these concepts automatically of what people might find uh like they'll have some

01:26:19

uh uh they'll just learn from all the pre-training data what people would find useful and helpful and what they'll have like some there'll be some complex like like moral theories that they can they have and they can but of course there's still a lot of room to latch onto a different like different

01:26:42

style or a different morality so I think like when we have um like if we were to write a um a doc or if we're gonna align these models what we're doing is latching onto a specific uh like specific style a specific um morality and there's still like a decent you still need a decent uh decently

01:27:03

long document to uh to capture exactly what you want yeah uh how much of a moat is better post training uh uh currently companies i distinguish themselves by well well how big are models and so forth uh will it be a big moat who has figured out all the finickyness that you were talking about

01:27:19

earlier with regards to uh all this data i think there's something of a moat because it's just a very complex uh operation and there's uh so it takes uh you have to have a lot of uh skilled people doing it and uh so there's a lot of tacit knowledge and uh um there's a lot of organizational knowledge

01:27:42

that's required so um so i think um yeah i think post training uh like to create a model that actually um like has all the need the functionality people care about um is pretty complicated uh uh it requires a pretty complicated effort um so and this um requires a lot of this is basically

01:28:05

uh an accumulation of a lot of r&d um so i would say um i would say that makes it somewhat of a moat that's not trivial to spin this up immediately uh it does seem like um like the same companies that are putting together the most serious uh pre-training efforts are also putting together the

01:28:25

serious post-training efforts so uh it seems like uh it is uh it is somewhat um somewhat possible to copy or to to spin up more of these efforts um there's also like one force uh that sort of makes it less of a moat is that you can uh like distill the models uh or you can take someone

01:28:47

else's model and clone the outputs or you can uh use someone else's model as a judge uh to like do comparisons so i think uh like the more big league people probably aren't doing that because it goes against uh terms of service policies but and it would also be uh uh sort of it hurts

01:29:07

hit to their pride but i would expect some of the smaller players are doing that to get off the ground um and that catches you up to a large extent i guess that was really moat what what is the median rate or like where are they based what are their politics uh what is their sort of knowledge

01:29:22

level i would say it's um it varies a lot so we've definitely um hired uh raiders with different skills uh or yep for different kinds of tasks or projects um so i would say um um like a decent a decent mental model is uh just look at people who are on upwork and other platforms like that

01:29:47

like who's doing um sort of odd jobs with remote work um so it's um yeah it's a pretty international group those there's a decent number of people in the u.s uh we hire different um people uh like different um groups of people for different types of labeling like whether we're more focused

01:30:09

on writing um or like stem tasks so uh people doing stem tasks are more likely to be in india or other sort of um like uh middle or lower middle income countries uh whereas people um doing more like english writing and composition and i tend more to be like u.s based

01:30:29

um so yeah and i'd say there there've been times when we needed to um hire different experts for some of our campaigns uh some of the people are very some of them are very talented and uh like we even find that they're like at least as good as as us the researchers at doing these tasks

01:30:48

and they're like much more careful than us so i would say uh i would say the people uh we have now are um quite skilled and uh conscientious um um with with regards to the sort of plateau narrative one of the things i've heard is that a lot of the abilities these models have to help you with

01:31:07

specific things is related to the having very closely matched labels within the uh uh super wise fine tuning data set uh is that true like if if it can teach me how to use ffmpec correctly like there's somebody who's like doing uh figuring out uh seeing the inputs and seeing what flags

01:31:28

you need to add and some human is figuring that out and smashing to that and is uh yeah do you need to hire like all these labelers who have domain sexualities and all these different domains um because if that's the case it seems like it would be a much bigger slog to get these models to be

01:31:43

smarter and smarter or a time right you don't exactly need that um because uh yeah you can get quite a bit out of generalization um so if you um like uh like the base model has already been trained on tons of documentation tons of code uh with shell scripts and so forth so it's already

01:32:05

seen all the ffmpec man pages and uh lots of bash scripts and everything and uh it's um so uh like the base even just giving the base model a good fuchsia prompt you can get it to answer queries like this and uh just training a preference model uh like for helpfulness will

01:32:26

um uh even if you don't train it on um probably even if you don't train it on any stem it'll somewhat generalize to stem and uh uh like um this so you not only do not need uh like examples of how to use ffmpec you might not even need anything with programming uh to get some reasonable

01:32:47

behavior in uh the programming domain maybe final question is we've touched on this in different ways but to put it together so you say you're training on much more multimodal data presumably like these things understand what screens look like and we'll be able to interact interact with

01:33:04

in a much more coherent way and also you're going to do this along horizon rl so they'll be able to act as agents and assistants who can be part of your workflow in a much more integrated way what what do you expect that to look like and we'll be the next steps from there so suppose

01:33:24

by the end of the year or next year you have something that's like an assistant who can work with you on your screen it does that seem like a first of all a sensible thing to expect and then where does it go from there i would definitely um yeah i would expect uh things to move in that direction

01:33:40

it's unclear what's going to be the best form factor whether it's like uh something that's uh it it's like a clipy that's on your computer and helping you with something or if it's more like a like a like helpful colleague in the clouds so we'll see uh which kinds of form factors um work the best

01:33:57

uh and i would expect people to try all them out um yeah i would expect more uh like yeah i would expect something like a um yeah the mental model of uh like a um helpful assistant or helpful colleague to become more real um where you can share more of your uh every day work or

01:34:18

have it uh like instead of just giving it one-off queries you would have a whole project that you're doing and it knows about everything you've done on that project so far you can tell it uh it can like even proactively um make suggestions like uh maybe you can tell it uh oh yeah like remember to

01:34:37

ask me about this and if i've made any progress on it so i think like pro activity is one thing that's been missing uh yeah i'd really love to see um better um like um a more uh like like moving away from sort of one-off queries uh like using the model kind of like a search engine

01:34:56

yeah as part of search engine and more towards uh like having a whole project that um i'm like doing in collaboration with the model and it knows everything i've done it's proactively uh like um suggesting things for me to try or it's going and doing work in the backgrounds yeah that's

01:35:14

that's really that's really interesting what by the way it's a final question what is your what is your median timeline you have replaces your job yeah oh replaces my job uh maybe like uh five years yeah pretty soon yeah um interesting okay well John this is super interesting uh um yeah

01:35:34

uh thanks so much for making the time i think this is like seems like one of the parts of the AI process that are super important and people don't uh understand that much about so it was super interesting to delve into it and yeah yeah thanks for having me on the podcast who's fun to

01:35:50

talk about all this stuff hey everybody i hope you enjoyed that episode the john he's just a very thoughtful guy and it's super interesting to learn about the way in which these models become the kind of shock that they are anyways as you can see i'm now doing ads on the podcast so if you'd

01:36:06

like to advertise you can reach out at the link in the description and of course if you enjoyed the episode it's really helpful if you can share it with other people who you think might enjoy it your friends group chats twitter whatever else see you on the next one cheers

✨ This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

Episode description

Transcript

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

Episode description

Transcript ✨

Transcript