Ep 64: GPT 4.1 Lead at OpenAI Michelle Pokrass: RFT Launch, How OpenAI Improves Its Models & the State of AI Agents Today - podcast episode cover

Ep 64: GPT 4.1 Lead at OpenAI Michelle Pokrass: RFT Launch, How OpenAI Improves Its Models & the State of AI Agents Today

May 08, 202547 minEp. 64
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

OpenAI's Michelle Pokrass shares insights into GPT-4.1, detailing its focus on real-world developer utility through user feedback and specialized evaluations, improving instruction following and long context. She explains the current state of AI agents, highlighting the power of fine-tuning (especially RFT) for pushing model frontiers in niche and deep tech domains. The discussion also covers strategies for developers to stay ahead of rapid AI progress, model selection, and future research directions at OpenAI.

Episode description

In this episode, I sit down with Michelle Pokrass, who leads a research team at OpenAI within post-training focused on improving models for power users: developers using OpenAI models in the API and power users in ChatGPT. We unpack how OpenAI prioritized instruction-following and long context, why evals have a 3-month shelf life, what separates successful AI startups, and how the best teams are fine-tuning to push past the current frontier.

If you’ve ever wondered how OpenAI really decides what to build, and how it affects what you should build, this one’s for you.

 

(0:00) Intro

(1:03) Deep Dive into GPT-4.1 Development

(2:23) User Feedback and Model Evaluation

(4:01) Challenges and Improvements in Model Training

(5:54) Advancements in AI Coding Capabilities

(9:11) Future of AI Models and Fine-Tuning

(20:44) Multimodal Capabilities

(22:59) Deep Tech Applications and Data Efficiency

(24:14) Preference Fine Tuning vs. RFT

(26:29) Choosing the Right Model for Your Needs

(28:18) Prompting Techniques and Model Improvements

(32:10) Future Research and Model Enhancements

(39:14) Power Users and Personalization

(40:22) Personal Journey and Organizational Growth

(43:37) Quickfire

 

With your co-hosts: 

@jacobeffron 

- Partner at Redpoint, Former PM Flatiron Health 

@patrickachase 

- Partner at Redpoint, Former ML Engineer LinkedIn 

@ericabrescia 

- Former COO Github, Founder Bitnami (acq’d by VMWare) 

@jordan_segall 

- Partner at Redpoint

 

With your host:  

@jacobeffron  

- Managing Director at Redpoint

Transcript

Intro

Michelle Pokris is one of the key people behind GPT 4.1 and OpenAI. As a post-training research lead, she played a crucial role in making these models so much better for developers.

I'm Jacob Ephron, and today on Unsupervised Learning, we dug into everything GPT four point one and more. Some of my favorite parts from my conversation with Michelle include the current and future state of agents, whether future models will be purpose built for different groups, We talked about RFT and what it will mean for builders and tactics for figuring out what's just out of reach for the models versus way in the future.

We also talked about how companies can set themselves up for success with rapid AI progress and what kind of founders will win at the app layer. And we finally hit on what's next for OpenAI's agent products. It was an awesome episode with someone who's helping define the cutting edge.

Before we get to the episode, I just have one plug. If you're enjoying this on Spotify or Apple Cloudcasts, please consider leaving a rating on the show. Ratings help us grow, which help us to continue bringing on the best guests and keep the conversations coming. Now here's Michelle Procris.

Well, Michelle, thanks so much for uh for coming on the podcast. Really appreciate it. Yeah. Thanks for having me. Very excited to be here. Yeah, there's a a ton of different things I'm excited to explore with you today around GPT four point one. You mentioned the model has like

Deep Dive into GPT-4.1 Development

more of a focus on real world usage and utility and less on benchmarks. And I feel like that's definitely resonated in the Twitter discourse and just people playing around with the model. How do you actually go about making that happen in practice? Yeah, it's a good question. I mean the real goal of this model was something that's a joy to use for developers.

Um often, you know, and we're not the only ones uh who do this, but sometimes you optimize the model for benchmarks and it looks really great, and then you actually try to use it and you stumble over basic things like, oh, it's not following my instructions or the formatting is weird. Um

Or, you know, it it the context is too short to be useful. And so with this model we really focused on what have developers been telling us for a while now that they want, uh, and how we c can we reproduce this feedback. So a lot of the focus was on you know talking to users, getting their feedback, and then turning that into an e val that we can actually use during research.

So I would say there was like a pretty long lead up before we even got to model training. We were just kinda getting our house in order on evals and understanding where actually are the biggest problems in our models. Um and so we actually put this in the blog post, but we have an internal instruction following eval. It's based on real API usage. Uh it's based on what people have told us.

And this is kind of one of the North Stars while developing this model. Yeah, I'm curious'cause I've heard you talk about this I this idea of of picking evals and, you know, basically going to startups and people that are building on top of the API and saying, What what are the things that the models can't do? And um let's try and hill climb on those.

User Feedback and Model Evaluation

How do you go about figuring out I'm sure like everybody has their their pick of like, you know, fifteen things they want you to optimize for. How do you like go about figuring out what are the evals that matter and any learnings on that like over the course of building this? Yeah. I will say it's actually more of the opposite problem. They're not coming to us with like, oh, I have these a hundred evals.

Please fix all of these. It's more like they're saying, Ah, it's kinda weird in this one use case and then we have to be like, What do you mean by that? And we like actually, you know get some prompts going and figure it out. And so I'll say a lot of the legwork has been just like talking to users and and really pulling out the key insights.

There's actually an interesting insight I got recently from talking to a user where it turns out our models could do better on kind of sometimes you want to tell them ignore everything you know about the world and only use the information in context. Um and this is something we would never see in an eval. Like AB, GP QA, none of them look at this. But for this specific user, what's most important is the model will attend only to the system instructions and ignore everything it already knows.

So yeah, back to like the question of how do we determine what's most important. basically just see what comes up over and over again in themes with customers. And then we also use our models internally and we have a sense for where they're not doing well. And we also have internal customers building on top of our models. So basically all of these things put together make it uh

That that's how we determine which set of evals to really go after. Aaron Powell Do you have a request for evals for our for our listener base? Like are there some areas where you're like, oh, we really would love more, you know, uh uh examples or or things to test around certain areas. Yes. Yes. Always requesting more. Um

Challenges and Improvements in Model Training

I'm always pitching like we have this evals product where you can opt in and you get free inference on the evals. In exchange we get to use them. But in particular, the things I'm interested in are more long context, like real-world evals. Uh it's really hard to make a long context eval. Like synthetic emails are are nice to like target really niche use cases, but if you want to get like holistically, does this work in long context?

We're kind of you know, we could use more of those. And then the the other one is instruction following. This is like the hardest thing to define in ML, I feel. Everyone is like the model didn't follow this instruction, it's not good at this. But people actually means hundred mean hundreds of different things. Uh and so anything more there I'm always interested.

Did you have any favorite like uh random random evals that emerged in this process? I mean you mentioned obviously the uh you know, some some examples already, but like any any that, you know, uh were surprising I guess in things that weren't working or or you thought were like particularly fun fun ones to hill climb on? This is interesting, like we we tested a few different versions of four point one and with with real alpha users and got their feedback.

And one customer just really preferred the first version over the fourth one. Uh which is the one we ended up shipping and they were the only user to feel this way and all of the evals were up and to the right between these and we just cannot figure out what it was and it you know, it i it was just some really niche use case that wasn't covered a anywhere. Hard to please everyone with uh with these models. It's nearly impossible.

If you make something that follows instructions pretty well, then you can try to please more people by teaching them to prompt better. And then you know the fine-tuning offering I think is a really great way of pleasing more people. A hundred percent. Well we'll definitely dig into to to both of those aspects um, you know, here. I guess like I'm I'm curious, the the the model's been out for a few weeks now.

Advancements in AI Coding Capabilities

Um I'm sure you had all these, you know, in you were obviously testing this with plenty of people, so you had some sense of how people would use it. But then it's always fun to get it in the wild and see, you know, all sorts of unexpected ways. Any kind of like unexpected things that it's been able to like bridge or solve um that have been kind of fun to see these last few weeks.

Yeah, I've really loved seeing a lot of the cool UIs people have been building. Um so actually this is something we snuck in near the very end of the model is like much improved. UI and coding capabilities. Um so I've seen really cool, cool apps there. I've also loved seeing people make use of nano. Uh it's you know small and cheap and fast. And uh I saw I think box is d has some

uh product feature where like you can read seventeen pages of docs and I know Aaron tweeted like some results using the using the models and it was a pretty uh pretty impressive uplift in the in the core product. Yeah. It's very cool to see like the hypothesis behind Nano was Can we just spur on a ton more AI adoption with Models that are cheap and fast? And and it looks like the answer is yes. Like people Just have demand at all points in the cost latency curve.

I feel like that that answer seems to have generally been yes, uh throughout this. Uh you know, you guys are are always cutting prices and it seems to always keep spurring uh Spurring more demand. Um, you know, I feel like you've been acknowledged by by by Sam, by Gnome, by all sorts of folks as like, you know, really one of the ringleaders of of making this whole thing happen.

What is actually involved in like shipping a model like this end to end? Um, and and what work are you guys doing behind the scenes to kinda like make this happen? Yeah, it's a great question. So obviously there's a large team behind the scenes. Um and so we have basically uh these three models are each kind of a semi new pre trained. Um so we have uh the the standard size is mini in the nano. So really great work from the pre training teams and to top. What does a semi new pre training mean?

Yeah, it's a good question. I mean it's kind of like a we call it a mid train. It's a freshness update. Um and so the the larger one is is a mid train, uh but the other two are new pre-trains. Uh and then my team works a lot on post-training. Um so we've been focusing a lot on you know how do we determine the best mix of data or how do we determine the best parameters for RL training or how do we determine the weighting of different rewards.

Um and so back to like how this all came to be, I think we started realizing, you know, a lot of developers had a lot of pain points with Foro. And we went I would say like three months in on evals and figuring out what the real problems were. And then I would say the next three months was kind of a flurry of trading. Um and so we would just run tons of experiments like How does this data set work? Or like what if we tweak these parameters?

Um and then that all kind of linked up with these new pre-trains. Uh and then we finally had a like about one month alpha testing where we were trading stuff really quickly.

Rapidly and getting feedback and trying to incorporate that as much as possible. You know, uh a part of this is uh it was gathering these evals. Like, does it feel like that set of evals is still relevant, or is it now like you have to go gather a whole new set of stuff that like, you know, maybe is the right stuff to hill climb on for, you know, improving upon four one? Yeah, I think the shelf life of an eval is like three months.

Unfortunately, like the progress is so fast, things are getting saturated so quickly. Um so we're still on the hunt as always. And I think we always will be. I mean, one of the things obviously that's so clear in the model is that you improve instruction following, you improve long context, obviously both incredibly beneficial for agents. You know, I I think our listeners are always trying to figure out like

Future of AI Models and Fine-Tuning

Where are we with agents? Like how do you characterize today like what does work, what doesn't work? Like what's kind of the state of the field post uh post four point one? I think where we are is that agents work remarkably well in well-scoped domains. So, you know, a case where

You have all of the right tools for the model. It's fairly clear what the user is asking for. We see that all of those use cases work really well. But now it's more about bridging the gap to like the fuzzy and messy real world. It's like the user typing something into the customer support box. you know, doesn't actually know what the agent can do and the agent maybe is missing an awareness of its own capabilities.

Um or maybe like the agent isn't connected enough to the real world to know, you know, a certain piece of information. Um so I think honestly I think a lot of the capabilities are there, but it's just so hard to get the context into the model. And then one area I do think we can improve is like ambiguity. Like we should make it easier. To for developers to tune, you know, if it's ambiguous, should the model ask the user for more information or should it proceed with assumptions?

It's obviously super annoying if the model is always coming back to you and be like, should I do this? Are you sure? Like can I do this? Um I think we need like more sterability there. But yeah, I would say we've all we've all worked with interns like that before. So uh I I get that there's a a fine balance to strike.

You want some delegation but not too much. Aaron Powell It sounds like, you know, uh the underlying capabilities of the models, you know, in in many ways aren't being fully shown just because we haven't connected enough context in or tools in to the uh to the models themselves. And

Um seems like there's a lot of a lot of improvement on uh on just doing that. Yeah, exactly. Yeah, I'll say when we look at like some of the external benchmarks for function calling or agentic tool use, I and we actually dig into the failure cases like where our models are graded incorrect. I see that they're mostly misgraded or maybe it's ambiguous and or maybe

They're using a user model and the user model isn't following instructions well enough. Um and so we're actually struggling to find cases where the model actually just does the wrong thing. There obviously are those, but most of the benchmarks I would say there are saturated. Aaron Powell I imagine like over the next six, twelve months a lot of that stuff gets added in.

You know, I feel like one of the the gaps remains kind of like longer term, you know, task execution. Like how do you think about what needs to be done to kind of continue making progress toward uh some of these longer or like more, you know, ambiguous, you know, many step tasks? Yeah. And I think we need changes like

on the engineering side and the model side. So on the engineering side we need APIs And and you know, UI is where it's much easier to like follow along with what the agent's doing, a summary of what they're up to, a way to like jump in and and change the trajectory.

We have that in operator, it's pretty cool. You can kind of jump in and like steer. But you don't have that as much for other things in our API. Um and so I think I think that's a core capability on the engineering side. And on the modeling side. I think we need more robustness, like when things go wrong. Obviously sometimes Your API will have a five hundred and the model will kinda get stuck. Um and so I think we're hoping to trade in more of more robustness and like

Grit uh is a is another way we think about it sometimes. Aaron Powell Another part of the of the models that I think everyone's noted on and obviously you have on the benchmarks is just how much better they are at code. Um and so I guess, you know, to start there, like

How how do you kind of you know characterize where we are with like what you know where we are with AI code, like what works, what doesn't? Yeah, totally. Um So I think where we are for code is that four point one and uh some of our other models are remarkably good when The problem is like locally scoped. So maybe you're asking the model to like change some library and all of the files are, you know, near each other and it makes a lot of sense.

But we see like the sweet bench tasks that we're missing, for example, are those where the model really needs global context and it needs to like reason about many various parts of the code, uh or maybe there's like some extremely technical details in one file and you're trying to pass them into another. Um so I would say like we're still improving kind of that global understanding.

I also think we've made a really big improvement on the front encoding. Uh, but I still would love to continue improving like we should we should not only produce front end code that's beautiful, but like a front engineer should be proud of it. Um and so there's some linting stuff there and and code style uh is another top focus area for us. And finally I think another thing we're always gonna keep improving is

Uh like changing only what you asked for and not everything else. Like the model should adapt to the style of your code and not kinda inject its own style too much. Um and on our internal e-vales we see you know it went from I think nine percent to two percent from four oh to four point one these like irrelevant edits.

But obviously two percent is not zero. And so it's something we're gonna continue improving. Yeah. What does that mean for how then you end up using it in your kind of day-to-day coding? I you know manage a team now, so there's not that. Alas, the inevitable uh the inevitable trajectory of uh of doing well at the at these companies. But I do use codecs um and I I have honestly still been using GitHub Copilot.

It's still a great product and I I also dabble with Woodsurf and Cursor. So I'm in and out. Um but Codex is is really cool, the way it does stuff independently. Um and I think the the main model I use there is uh O four mini, just for speed. You know, obviously you've kinda alluded to this. There's like lots of benchmarks and, you know, it I feel like people are always debating like are benchmarks still relevant? You know, I think you guys even added uh some, you know

So into, you know, 401. I think there's been this feeling in coding for a while, for example, like benchmarks don't tell the full story and you kind of like know it when you use it. Like, to what extent is that true? And what's like your overall view on like the the state of these benchmarks today and how useful they are?

Yeah, I do think uh Suibench is still a useful benchmark. Like the actual differences from a model that can achieve like fifty five versus thirty-five are are staggeringly different. Um I think the Ader evals are are still super useful. But then there's ones that are just like fully saturated and not useful. Basically you gotta like

use the most out of an eval during its lifespan and then move on and create another one. And so I do it as a three month shelf life uh definitely is uh is tough. Yeah. There's gonna be successors to Sweebench once that's saturated for sure.

I mean one thing I think that's so interesting about four one is that I think you guys have been very explicit, like this was built for developers and like, you know, there's all these like evals you did to make it better for the things that developers were asking you for. And it kind of does beg the question, like, how does the

open AI model family evolve from here. Cause obviously you could imagine like a pre-trained model that's post-trained for different end users or I don't know, domains or tasks. Like I'm sure you guys learned a ton, you know, Kind of building this model for this explicit end group. How how do you think about that? In general, my philosophy is that

we should really lean into the G in A GI and trying to make one model that's general. Um and so ideally I think going forward we're gonna try to Simplify the product offering, try to have one model for both use cases, and you know, simplify the model picker situation in Chat GPT as well. But for four point one, we thought there was like a particularly acute need, and uh we thought we could move a lot faster at this problem if we could decouple from Chat GPT. Um so this let us, you know.

train models get feedback much quicker, ship on a different timeline. Uh and it also let us make some interesting choices with model training. So we were able to remove some of the data sets. specific to Chat GBT. And we were able to like upweight the coding data significantly. Um and so this is stuff you can do when you're you kind of targeting a separate uh domain.

But in general, I do expect us to simplify. And I think the models get better when like the creative energies of all researchers at OpenAI are working on them. Um rather than, you know, the f the subgroup focused on on the API right now.

Well, it also seems like there's been massive like cross-domain generalization anyway, where, you know, in in in general it feels like the you know putting it all into one model has been beneficial. But it's interesting, obviously, that this this has been such a success uh with that more targeted approach. Yeah, there's room for both, I think. Like sometimes it makes sense to ejaculate.

And ship thing for a user you know really well. So you think that do you think that's something you guys might do again? Yeah, I think it's possible. I mean we We don't uh like we make a lot of uh changes on the fly like as we see what demand is there. And it's definitely possible.

Well, one thing I obviously hear from folks all the time is you you guys ship models very rapidly. I know the the naming has always been debated at ad nauseum about uh how many different models there are. I feel like companies are trying to keep, you know, stay on top of like what the cutting edge of model capabilities are, you know.

Any best practices you've seen from like, you know, what what companies do to like just stay on top of, you know, it feels like a new model drops every like month in this space. Um and, you know, how would you be thinking about it if you were uh at one of the uh users of these APIs?

It's all back to evals, unfortunately. Like the most successful startups are the ones who know their use case really well, have really good evals, and then they can just spend, you know, an hour running evals on the new models when they drop.

Um there's also I think uh like the customers are really successful are the ones who can switch their prompts and their scaffoldings and tune them to the particular models. So that's what I would recommend. Um then the other thing is to build stuff which is maybe just out of reach of the current models. Um or maybe it works one out of ten times and you'd love it to be nine. Uh if you have these kind of use cases in your back pocket.

the new models drop and and things just work, then you're, you know, first to market. Do you have a heuristic you use for what's just out of reach? Like obviously I feel like it's hard to tell sometimes, like if it's uh you know, how uh how how soon some of these things might work. Yeah, I think if you see uh like significant improvements in fine tuning.

Like let's say you're getting a ten percent pass rate, you can fine tune it to fifty percent It's probably not good enough for your product yet.

That's something that's right on the cusp and and a future model a few months from now we'll probably just crush it. No, that makes uh that makes a ton of sense. I mean obviously you mentioned kind of like the you know being able to switch the prompts and the scaffolding. I think one thing that like I think a lot about on the uh on the on the investment side is

You know, there's lots of companies that, you know, the models are able to do what they're able to do. There's all sorts of scaffolding they build, you know, based around those limitations to make the products work today. And then it feels like

you know, you guys release a next great model and some of that scaffolding just gets obviated. It's like, okay, cool. Like the models are way better at like following instructions. I don't need to do all this hacky stuff'cause you have this long context window now.

Given that, how do you think about when it like does and doesn't make sense to to like build some of this scaffolding? Or like what set of scaffolding makes sense for these folks you know, for people to focus on? I like to take this back, I guess, to like your reason for being as a startup, like your reason for being is to to ship value to your users and and make something people want. Um and so I think it is super worth it to to build the scaffolding and like make your thing work.

You basically are doing like a few months of arbitrage um before this capability is available more easily. But I do think it's important to keep in mind future trends. So like maybe build the rag thing for now or or maybe like have your instructions five times in the prompt, although not with four one. Um but just be prepare be prepared to change things. Just but know where things are going. So

You know, I think context windows are only gonna keep improving. I think reasoning capabilities are only gonna get better, uh instruction following is only gonna get better. And so just have an eye to where those trends are going. Yeah. Any any other like uh you know, for for where things are going, uh like tips for folks? Yeah, I think multimodal is another one.

Multimodal Capabilities

Like the models are getting so natively multimodal uh and and easy to use in those days. Yeah, I feel like that's been a pretty under discussed part of uh of four one. It's pretty impressive uh multimodal capabilities. Yeah, honestly shoot huge shout out to our pre-training teams because these new pre-trains have just significantly improved upon multimodal. And I think we will will will continue to see these improvements.

Um but so many things that didn't work in four o just work now because you know the models have gotten better there. And so y it's worth it to connect the model to as much information about your task as possible. Even if you're getting mad results today.

'Cause tomorrow it'll get better. And you mentioned like fine tuning. I mean, I think it's it's interesting, right? I feel like we've gone through this journey with fine tuning where, you know, early on I feel like folks a lot of folks were like, I don't know how helpful this actually is and then it feels like there's there's been a renaissance of fine tuning with these newer models and and and how helpful it actually is. Like

I guess I'm curious what you've observed. Like does that arc ring true to you? And like how how should people be thinking about this and and should more people be revisiting their uh their prior assumptions around fine tuning? Yeah, I think I would bucket fine tuning into two camps.

Um the first is fine tuning for speed uh and latency. Yeah. And so this is still I think the workhorse of our SFT offering. So you four point one works well and but you can get it at, you know, a a fraction of the latency. But then, you know, I think we haven't seen too much of fine tuning for frontier capabilities. Um like you can could maybe get them in a really niche domain for with SFT, but with RFT, uh you can actually push the frontier in your specific area.

And uh the fine tuning process is so data efficient. that you can just make do with like a hundred samples or something on that order. Yeah. So our RFT offering is uh actually shipping to GA next week. I guess your your reader your listeners will probably hear about it when it's out.

Um and we're really excited about that. There's some use cases where it works really well. For example, like teaching an agent about how to pick a workflow or uh how to kind of you know, work through its decision process.

Deep Tech Applications and Data Efficiency

Then there's also some interesting applications in deep tech, um where, you know, maybe the startup or or organization has data that other folks don't have and it's really verifiable. And from that you can get the absolute best results with RFT.

I think one thing, you know, that I I've been struck by at least is like it feels like across the board the number of examples you need are is is not massive, right? I think in the early days people, you know, were like, Oh, well like some of these companies sit on tens of thousands of examples and, you know, they'll just be able to to, you know, out compete and it's like

You know, it feels like I mean the data really does matter, but it's maybe to the tune of a lot less examples than uh than folks might have previously thought. Yeah, I think these two trends are are making fine tuning more uh more interesting where like the it's extremely data efficient. And also RFT is basically the same RL process we use internally for improving our models.

Um so it we just know that it works remarkably well. Uh and it's it's less fragile than than S F T. And so uh yeah, for those reasons I think it's it's gonna be really useful for deep tech and and some of the hardest problems.

Is this the kind of thing you think everyone should play around with or like you know, is it is it like I mean obviously there's some cases the models can do, but let let's take, you know, almost anything that that maybe uh they aren't as accurate as folks want. Is it like worth trying this for for you know, for for any of those cases?

Preference Fine Tuning vs. RFT

I think my mental model is if it's a stylistic thing, then you should probably use preference fine tuning, which we launched uh somewhat recently. If it's more simple, like you know, maybe you want nano to classify things and it gets you know, ten percent of cases wrong and you can close that gap with with SFT, that's great. But then for the things where just no model in the market does what you need.

Um then you should turn to RFD. It sounds like you were kind of alluding to the fact that there's like some things, especially when they're verifiable, that that like make this easier to do. Do you have like any rough like rules of thumb you use for like when RFT you know, the the types of domains or the types of problems that RFT will be like particularly effective for?

Or like what these eas more easily verifiable domains are. Like everyone's asking this question now outside of uh outside of code and uh and math. Yeah, I think there's stuff in like chip design. Um Or in biology, uh just like Stuff like drug discovery, I think those sorts of of things where Maybe you need to explore but the things that work are easily verifiable. Um I think those will be good applications.

Certainly chip design is that. I I feel like drug discovery of perpetual uh awesome use case, but sometimes it takes ten years to figure out if it actually uh actually works in people. So the feedback loop is always uh obviously interim steps in between. But eventually I mean it does beg kind of beg the question I mean if you know you see in 401 obviously these multimodal capabilities. You know, you talk about kind of the ability to use RFT for um you know uh for for biology.

I guess there's always been this question of like, you know, are there gonna be like standalone types of foundation models, like a robotics foundation model or a biology foundation model that has like nothing or has something to do, but like is kind of a separate class of models, like What's your kind of view on that? Does it feel like it's you know you kind of mentioned the G in AGI uh before. Like does it feel like we're converging uh in that in that aspect?

I kinda do. I think generalization uh, you know, improves capabilities a lot. Um I think it remains to be seen with robotics like you know I guess we'll know empirically if if the best robotics products are are their own models. But I do kind of think and I think the trends I see here internally are

combining everything just produces a much better result. Everyone's cheese that you'll have soon, like, you know, one model that will like pick up behind the scenes for people what to what to use. But obviously we don't have that yet today and so

Choosing the Right Model for Your Needs

Uh, you know, I'm curious if I'm a company and, you know, uh figuring out like which you know, obviously I'll probably test a bunch of them. Do you have any rough rules of thumb on like which models people should be choosing uh for the different things they're trying to do? Yeah, totally. Um It's a pretty tough decision tree, so I'm excited we're gonna simplify it. But here's here's how I think about it.

In ChatGPT, uh I'm obviously a chat GPT Dow. Um and so my You you and me both. Yeah. My main model there is four O and I use four point five sometimes for writing or creative stuff. And then O three is what I use for like the hardest math problems or like, I don't know, I was filing my taxes and I wanted them done right. You know, somewhere where I'd use O three. Um

Does that line up with you? Is are those the models you use in chat? I I wasn't sure the models were good enough yet to trust my taxes to them. So I haven't yet done that. But maybe I should've if you're saying it's good enough, that's great. Next year I will uh I will totally go ahead and do that.

I'm more in double checking my CPA. But yes, that definitely lines up on the uh on the on the on the kind of consumer side. And then I'm curious like for the for the enterprise users, like, you know, how they you know, obviously I feel like You always want to go as as as as fast and cheap as you can. Um, but I think folks are still trying to figure out exactly like when to when to reach for each different kind of model. Yeah, totally. So yeah, how I think about it there is

Developers should just start with four point one, um, see if it works well for their use case. If it does and you're looking for faster, then I would look into mini and nano and fine tuning those. Obviously Mini next and then nano. Um As the smallest model. And then uh if some things are just out of reach for 4.1, then I would push for 04 mini uh and see if you can kind of you know, get the sufficient like reasoning capabilities out of it. And then you go to or three.

And then if that's not working, then you go to RFT with with O four mini. I guess on the other side of using these models, one thing I always enjoy is like the prompting guides you guys release behind the alongside these models because it's always kinda like

Prompting Techniques and Model Improvements

funny sometimes counterintuitive, like the the different things that help on the uh on the prompt side. Like any particularly favorite things that have emerged is like, oh, that's actually a really helpful way to prompt, you know, four point one. Yeah, I think we found XML or like structuring your prompts really well, uh works super well. The other thing is is just telling the model to keep going. I like I liked that one.

It's something we're hoping to fix for the next one, but it is remarkable like how much better performance you can get by telling the model like, Hey, please don't come back to me until you've solved the problem.

So yeah, those those are interesting and somewhat counterintuitive. How do you go about like so like yeah, you've seen that keep going thing and obviously in your pr in your cookbook shows like a big impact. How do you then go about like, you know, incorporating that into the next generation of models such that like that that isn't a thing anymore?

Our post-training process can be pretty sensitive to the exact mix of data used. So, you know, you can imagine a post-training process where you train the model on one diff format, and then your users are using totally different diff formats, and the model is a bit lost.

Um whereas for four point one we train the model on like twelve or so different diff formats, everything we could think of. And so our goal is to really put out something that works really well. Um and even document maybe the best one. So our prompting guide has

you know, diff formats we found that work well. We also want it to work well out of the box for developers who aren't going to read our docs, which I like you know recognize as most. Um you you want it to work anyway, even even if you're not using the best. So we focus a lot on on uh general prompting and general capabilities. Uh and this way we don't, you know, kind of burn in the model a specific one.

Yeah. The keep going is a is a great thing to say to our team internally too. So, you know, it definitely uh it helps across across the board. You you've obviously mentioned evals as one thing that like the most sophisticated companies do well.

I'm curious if like there's something that you've you know, uh either maybe some of the open AI products or some techniques that like a select few companies are using really well and you're like, God, I just wish that like thousands of companies were were were using this or or thinking about things this way.

Yeah, I think some of my favorite developers to work with are those who know their problem really well and actually have evals for the whole problem, but can break them down into specific subcomponents. And so they can tell me things like the bottle got better at picking the right SQL table by this percentage, but it got worse at picking the right columns by this percentage. And it's like, wow, this level of granularity really helps you tease out like

what actually is working and what isn't. And then, you know, they can tune specific parts of this. So I guess like making your system modular and easy to plug different solutions into, I think uh That takes a little time up front, but makes you move faster in the long run.

I guess a question people are always asking is like, how much AI expertise will like the you know, will like the leading AI app companies need versus just like being good engineers that take your models off the shelf and know their end customer? Like Do you think long term, you know, being able to kind of like have a sense of what data to apply on the fine tune or like, you know, tweak your evals?

Does that end up being a really important skill set for the app category, or is it really like, no, they can kind of take the models mostly off the shelf or, you know, a basic fine-tune and uh the kind of core AI research capability may be less important? Yeah, I'm I'm really long generalist. Uh so I think people who understand the product are really scrappy, engineers who can do anything like

I honestly don't think you'll need that much expertise to to combine these models and these solutions in the future. So yeah, I I I'm definitely much more bullish when I hear about a team of like scrappy hackers than, you know, a bunch of uh PhDs with only like research publications under their belt. And there's so many exciting areas to continue pushing these models forward. Like what future areas of research are you most excited about to like make these models better?

Future Research and Model Enhancements

I'm really excited about using our models to make models better. And so this is particularly useful in reinforcement learning, when we can use signals from from the models to to figure out if if the model is on the right track. Yeah, I'm also excited In this is like a more general research area. But we're we're working on improving our uh

our speed of iteration. So the f the more experiments you can do, th just like the more research gets done. And so it's a real focus right now to make sure, you know, we can run our experiments with the fewest

Number of GPUs and get you know you want you basically want to kick off a job and know when you wake up in the morning that you know if this thing is working or not. Is that just like a pure infrastructure problem or like a you know like the the the the latter part? Aaron Powell Not really. You also need to make sure that

kind of the things you're training are uh at sufficient scale to to get signal on on what exactly it is you're you're experimenting with. So also some interesting ML problems there. Yeah. And then in terms of like using the models to make models better and kind of signals if you're on the right track, like Where or where are we in that? Like does that work or or or like, you know, are we still kind of early early innings of that? Yeah, it works remarkably well. Um I think.

Synthetic data uh has just been like a an incredibly powerful trend. Um So yeah, excited t to push this more but Every more powerful model makes it easier to improve our models in the future. You guys have also shipped some really interesting, you know, agents. I think Deep Research probably most famously uh is a product that I use all the time. Um, you know, and basically

You know, as as I understand it, like using reinforcement learning like on, you know, a tool or or or set of tools, right? Until the model gets really good at using it.

Um, how do you imagine that like type of approach scaling for agents at large? I guess it's kind of like a sub-variant of the question we were talking about earlier of like building these like specific models for, you know, end users or specific, you know, specifically doing RL on tools versus like the G of uh of of generalization here.

Yeah. Deep research is like zero to one or deep research and operator or like zero to one or two where you wanna train the model like really deeply on on this specific thing. Um but I think what we've seen with O3 is that we can just train the model to be great at all kinds of tools. Um and actually learning to use one set of tools makes it better at at other sets of tools. Um so I don't expect too much

of just like one tool specific training going forward. It's like we've kind of proven that out and now we can incorporate those capabilities broadly. And actually that that's one thing people really love about O3 is that it can do a lot of deep research. like a lot of those capabilities but but quicker and and uh you know you get it you can really go for deep research when you want like the absolute best report.

But if you want something somewhere you know in between, then O three is a great fit for that. Yeah. And as as the kind of the general models, you know, get better at at at using tools and um you know and doing some of these tasks, you know.

Are there areas that you think will like be easier or or harder? I mean obviously you guys have publicly said you're you'll have a coding agent. Um you know, I don't know if there's like as folks are thinking about again like what's on the you know, uh you know, w what what capabilities are or sooner rather than later, any just like

mental model you use of like, yeah, I think these things would be would would come before the next set of things. Yeah, I think I mean, yeah, coding is obviously coming soon given like sweebench numbers are already exceeding, you know, what a lot of humans would would get there. Um so I think the ability to supervise these long runs is is is there. Um in terms of other stuff, I think like long workflows.

So what's interesting about O three already is that when it calls developer specified tools, they're already part of the the chain of thought of the model. Um so the model can you know, use the thoughts of the previous tool call and the output and think some more about what to do. And so I think Because of that, the agentic like maybe customer support or uh other sorts of capabilities I think are I think personally are are there and it just need to be hooked up with everything to make

a cohesive product. Yeah. I mean it seems like in many ways like the capabilities of these models like exceeds like the actual like nitty gritty just implementation of of like yeah, hooking them up to things, getting them getting enterprises ready to use them in some way. But it's like Uh you know, I I think there's always this big debate of like if you stopped if you completely stopped model progress right now, is there like just

Tens of trillions of dollars of value to be extracted from the from just like, you know, from from these models. And it seems like you're very much in the uh in in the camp of yes. Yeah. I mean I think if you think about like the capabilities overhang of the internet. It it's we still haven't saturated, like things are still coming online, like internet is still eating the world.

Um and I think for AI, like we haven't even saturated the capabilities three point five turbo. Like I think still think they're billion dollar companies turned that only need that level of capabilities. And so now with four one and these reasoning models, like I think we have you know,

If if we truly stopped right now, I think we'd have ten years of of building at least. Sam's obviously talked about combining the model families into into this GPT five that will uh will will probably end the really fun, you know, point this and and point that and and all that. But like

What what actually needs to be done to like combine this into into like a model, a single model? It goes back to like what are the models good for? So right now the four series is really great for chat and most users in the world use foro. So they love the way it matches like tone and style preferences and it's a great conversationalist. It's good at like

figuring out deep conversations with people or like it's it's kind of a good sounding board. But you know, O three has a very different skill set. Um it can think through problems really hard. You don't really want the model to think for five minutes when you say hi. And so I think the real challenge facing us on post-trinity and research more broadly is like combining these capabilities.

So, you know, training the model to be like a just a really delightful chit chat partner, but also know when to, you know, reason. Uh And this kind of plays into four point one a bit. Like I I mentioned that we downweighted some of the chat data and upweighted to make coding better. Um So there are some like zero sum decisions in that sense. Uh where you have to figure out what exactly you're tailoring the model for.

So that's the real challenge in in G B D five is is like How do we strike this right balance? Yeah. I mean it's so interesting because I feel like some reason people have been drawn to, you know, different models in the past has been like intensely like personality basically. I like the personality or vibes of this model and I'm struck by

I mean, in in some sense, trying to combine it into one model, you get like a median personality. And I back to the earlier question of like, I wonder whether, you know, longer term folks will want like you know, and maybe they accomplish this through prompting or like, you know, through kind of like learning about you. Um and then the models themselves have all these personalities within them and and can kind of emerge. Any thoughts on that?

Power Users and Personalization

Yeah, I already think we're going in this way a bit with enhanced memory. So I think like my Chat GPT is so different from like my mom's or my husband's. So I think we're going in this d direction already. It's just becoming so much more useful the no more it knows about you. But also the more it knows about you, the more it can like adapt to the things you like. Uh so I think that's actually gonna be a really powerful lever for for personality in the future.

But we're also gonna make it more steerable. Um so we we want, you know, you can already use custom instructions and Tell the model like, hey, I don't like capital letters, or or please, you know, never uh never ask follow-up questions. I don't like that. Um so I think we're gonna lean lean more into steerability there. I think everyone should be able to kind of tweak the pr personality that that they want. But yeah, I'm curious like what's

What kind of personality are are you looking for? It's kinda like I'm still I'm still discovering, right? But I like I like kind of like the the banter is fun, right? And like a little like, you know, um kind of like hanging out with your your kind of like fun and and quirky and like, you know, kind of like

almost takes risks sometimes in like the stuff they're saying type friend. Um I feel like I always uh always enjoy that. I guess I'm I'm also curious just to kind of hit on your personal journey at OpenAI. Like obviously you've uh you've been you know you've done a ton of different roles within OpenAI. You've also like

Personal Journey and Organizational Growth

You know, the company has had a I mean, probably a million different sub chapters of of like growth and experience in your time there like Maybe just talk a little bit about your like personal journey there and like also, you know, how is it kind of what feels similar and different from like, you know, the the days early days you joined to like now like leading this large team here? Yeah. Um so yeah, I've been here for two and a half years.

And I joined uh on the API team on the engineering side. Actually a lot a lot more of my background is is engineering. Um I I've worked at other companies like Coinbase building. They're like high uh frequency, low latency trading systems. So a lot more of a focus on like backend distributed systems. Um but I did study uh AI in college and and I worked with um some professors there on research projects and I actually remember using uh OpenAI Gym at the time, uh which was super cool.

Um but yeah, I I was here for like a year and a half working on engineering and then I I it kind of seemed like it made sense to focus more on the model side for for the API specifically. There wasn't really enough of a focus um on improving the models for developers and I kept hearing like folks wanted something like structured outputs. And so that was kind of the first foray into doing research here. Um like training the models to do that and building the engineering systems.

And then after that, uh kind of formed this team and then moved over to research. Um and I actually recently re rebranded my team a bit and we we focus now on on power users. Um so it's the Power Users Research team.

I I th the reason for this like rebrand is that we don't just focus on the API. Obviously developers are some of our most discerning power users. They use features that you know other users don't know about. They know about prompting our models the best, they know the capabilities the best.

But there's also power users across, you know, all ChatGPT. There's some in free, there's funny in plus and pro and I'm kind of insulted I haven't been reached out to as a ChatGPT power user. I thought I might have hit the threshold, but I guess I guess there's probably some people I use a lot more. I mean yeah, we we Get a lot of signals from people who are using our models in this way. But also, this is like the reason it's interesting to focus on power users.

is because the things that the power users are doing today are gonna be the things that the median users are doing a year from now. Um so we just learn a lot from being on the frontier and figuring out what we can do to make the models better for them.

And I guess like what's it been like obviously, you know, over those two years, uh I feel like the the organization has changed a lot, uh both in size and the scope of things you uh you work on, like uh what kind of feels still the same and and what's really different, you know, these days? Yeah, I think the pace of shipping is the same. It it's actually remarkable like how how an organization this large can can move so quickly.

I think some things that are different is you just definitely can't have uh context on everything going on at the company anymore. Um, it's like You know, i it used to be more possible to like have pretty good state on all of the cool projects going on and read all of their research updates and And be intimately familiar, but now you you kind of just have to

Tolerate that you can't know everything cool going on anymore. Totally. Um well we always like to end our interviews with a quick fire round where we get your take on some overly broad uh closing questions. And so maybe to uh to start

Quickfire

Um would love your take on just one thing that's overhyped and one thing that's underhyped in like the general AI discourse today. So yeah, overhyped, I think benchmarks. Like like I mentioned, a lot of like the agentic ones are saturated um or People release like the absolute best

numbers they get, but, you know, realistic numbers are different. And then underhyped, um, I mean the corollary of that is like your own e bells. Uh and so using you know your real usage data to figure out what's working well under hype. Awesome. What's one thing you've changed your mind on in the AI world in the last year? Yeah, this is back to fine tuning, but I actually used to be a more of a fine tuning bear'cause it's kind of like

You know, it's a few months of arbitrage, but is it really worth the time? But I actually do think RFT is worth the time for for these like specific domains where you need to push the frontier. Yeah. Was there like one particular fine tune that convinced you or like, you know, it was a over time having seen this? I think the cool thing now is that like

You know, our previous post training stack or like the four point one stack is a lot more than just S F T. And so like we weren't shipping how we trained our models. But with R F T we you know, it's it's basically similar algorithm. as our reinforcement learning. And so that's why I think it's like it's a big shift where you can actually kind of get the capabilities that we can elicit ourselves. Uh do you think model progress will be more, the same, or less than last year this year?

I think I'm not sure. it'll be about the same. I don't think we're slowing down I don't think we're in a fast takeoff at the moment. But it's gonna continue to be fast. And there will be a lot of models. I I realize I I can't ask you to pick a favorite, but I'm curious at the like, you know, y you mentioned this like class of of of kind of harder to to solve problems.

you know, maybe a uh you know, uh beyond enterprise apps, like any kind of like consumer products or things that you're like most excited about outside of of open AI or things that you use uh in your in your kind of like day-to-day life. Yeah, I use a lot of stuff that that is like AI based. Like recently I've been using levels. Um and they are like have a pretty cool uh AI focus there. I think Woop has some very cool like health insights as well. Yeah. Yeah, I think taking AI out of just

The digital world is super cool. Well this has been a fascinating conversation. I want to make sure to leave the last word to you. Um where can folks go to learn more about uh you, four one, uh anything you want to point uh our listeners to? Uh the floor is uh the floor is yours. Yeah, totally. Thanks. Um so yeah, we put out a blog post for four one if you want to read more about it. Uh I'm also on Twitter uh and I love hearing feedback from users, like developers, power users.

So if something isn't working well in our models and you have a prompt that can show it, please email me. I'm uh First name at opening eye dot com and I just I love

Getting the feedback so we can make models better. We'll have to get you on again to talk about like the weirdest email you get from this of like, you know, a obscure use case prompt. Yeah, I've already gotten some good ones. Yeah. Uh well, Michelle, thank you so much. This was uh this was a ton of fun. Yeah, thank you so much for having

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android