Hey everyone, welcome to the Lead in Space podcast. This is Alessio, partner and CTO at Decibel, and I'm joined by my co-host Wix, founder of Small.ai. Hello, hello, calling in from Singapore here, but we are in the remote studio because the OpenAI team keeps shipping, and today they just live-streamed and released ChatGPT Codex. Welcome to Josh.
who I think we've talked about, we've met while you were at Airplane, right? Yeah, I've been building DevTools for a bit now, and you're the guy that I have to talk to when I'm building DevTools. I mean, you have now seen me complain a lot when things happen. So I don't know if it's a good or bad thing. It's a gift, man. Feedback is a gift. Thank you. Alexander, we're new to each other, but you've been leading a lot of the codex testing and demos.
Yeah, hey, I'm Alexander. I'm on the product team here. Awesome. So, yeah, we're going to just assume that everyone's watched the live stream. We also release a blog post. with a bunch of test demo videos. It's very interesting. I noticed in the demo videos, it was individual engineers sitting by themselves, very lonely. and then they're just talking to their ai friends coding with them i don't know if that's the vibe you want to give off but like that's how i came across
Yeah, man, those videos, we were going for like maximum authentic, just like how it helps them. Yeah, I'll take the feedback. But no, I mean, it's true. I mean, sometimes, you know, on-call is a lonely job, like mobile engineer is a lonely job. Like there's not that many of those. So yeah, totally. But anyway, so what did you guys individually do? Maybe we can kind of start there. How did you get pulled into the project? And we'll start from there.
Yeah, maybe I can go first because then we have a fun story about how we started working together.
Okay, so actually before working at OpenAI, I was working on a native Mac OS software called Multi, which is like about, like, it's kind of like a pair programming tool, but we thought of ourselves as working on like... um like human to human collaboration and then basically um as chat p and stuff came around we started thinking about like oh what if instead of a human pair programming with a human it was like a human pair programming with an
So I'll skip this whole journey, but that was this whole journey, and then we all ended up joining OpenAI. And I was mostly working on desktop software, and then we shipped reasoning models. And, you know, I'm sure you guys were, like, out of the curve in terms of understanding the value of reasoning models, but for me, like, it's kind of, like, starts off as better chat, but then when you can give it tools, you can actually, like, make it an agent, right? Like an agent.
reasoning model with like tools and environment guardrails and then maybe like training on specific tasks. So anyways, we got super interested in that and we were just starting to think about how do we bring reasoning models into desktop. And at the same time here at OpenAI, there was a lot of experiments going on with giving these reasoning models access to terminals.
I wasn't working on those first experiments, to be clear, but that was like the first true, like, wow, I really feel the AGI moment that I had. It was actually while I was talking to David Kay, a designer who was working on this thing called Science.
And, um, he showed me this demo of it updating itself. And, like, nowadays, like, I don't know if any one of us would be, like, the most impressed to, like, change the background color. Modifying its own code? Yeah. And then, you know, it was like, it had a hot reloading setup.
So I was just like, Mind blown at the time and I was still super cool demo and so We kind of were experimenting with a bunch of these and I sort of joined one of the teams that was like tinkering with this And, you know, we kind of realized like, hey, it's just like super valuable to figure out how to give a reasoning model access to.
to a terminal and then now we have to figure out like how to make that like a useful product and how to make it safe right like you can't just let it go loose on your local file system but that's like where people were initially trying to use it so a lot of those those learnings ended up becoming the codex cli which shipped recently
A lot of the work there, the thinking that I'm most proud of is enabling things like full auto mode. When you do that, we actually increase the amount of sandboxing so that's still safe for you. And then... so you know we were working on these types of things and then we started realizing uh we want to let the model think for longer. We want to have a bigger model. We want to let the model do more things safely without having to do any approval.
And so we thought, you know, maybe we should give the model its own computer. At the same time, we were also experimenting with putting the CLI in our CI so it could automatically fix tests. We did this crazy hack to get it to automatically fix linear tickets in our issue tracker. And so then we ended up sort of like creating this This project... That is Codex, which is basically like really the concept of giving the agent access to a computer.
Actually, I realize I don't know if you're asking what I personally did, but anyways, I told the story. I hope that's okay. Sure. No, I mean, you weave your personal story into the larger narrative anyway, but yeah, and I'm sure Josh has a part two. Yeah, yeah, so Miser is somewhat different. I've been at OpenAI for two months here, and it's been one of the most fun, chaotic, Two months of my life, but maybe I'll start back at the company I had founded a few years back called Air.
We were building an internal tool platform. The idea was to let you build internal tools, but really lean into developers and make that And it sounds unrelated, but in many ways, the similar themes started coming up. What's the right form factor for doing local development? How do you deploy tooling to the cloud? How do you run code? How do you compose all these primitives of storage and compute and UI to let developers build software?
I like to joke that we were just, I don't know, two years too early. Towards the end, we were playing around with, like, GPT 3.5 and, you know, trying to really make... It was really cool to actually build, like, a React. View really quickly, right? And I think, you know, if we had kept going on it, maybe it would have turned into some of the AI builders that you see today. But that company ended up getting acquired by Airtable, where it ran some of the...
the AI engineering teams there. And for me personally, towards the beginning of this year, I saw the progress we're making in software, agentics. And for me, it was a bit like my own moon landing kind of moment that I suspected was about to happen, right? Whether or not I was involved in the next two years, I think we are going to build an agentic.
And so I talked to my friend and I was like, hey, are you guys working on something like this? And, you know, he gives me a wide-eyed look. He's like, I'm not allowed to tell you anything, but maybe you could talk. And so, very fortunately, this was right when Alex and folks were spinning off things. And I remember actually, in our interview, we riffed on the form. Should it be CLI? The issues with that, waiting for it to finish and not being able to interrupt all the time.
wanting to run it four times, ten times in parallel. And, you know, at that point I said maybe it should be both. And we sort of are, you know, going for that right now. But, yeah, I don't know. I'll just say, like, I was very excited and so very excited. I'm pushing this far. Codex is still really excited to share it with the world, but there's a lot more to build. I'll say it was It was a very fun conversation when we first met because he came in.
I've never had this happen before. It's like, here's exactly the change that I see in the world and therefore the type of product that I want to build. I know you can't confirm if you're working on it, but just so you know, this is the only thing I want to work on.
And then I asked just a few open-ended questions, and we immediately got into some of the core debates around the form factor of the tool. And I was like, okay, this is awesome. I think a DeafTools person can spot another DeafTools person like that. Yeah, blank twice if you're working on this.
But for what it's worth, early iPhone team at Apple was the same because iPhone team members did not know if they were on the same team. They're not allowed to tell each other. So they had to like triangulate. Talking about form factors, you mentioned the CLI, which you already released, and I think there's other Cloud Code, AIDR, a bunch of other tools out there. Should people think of Kodak? as a hosted codec CLI. Are there big differences between the two? Let's talk about that.
Yeah, go for it. Yeah, I think of it as, I think that's the short of it, right? Allowing you to run Codex Agents. But I think that the form factor, like, it's a lot more than just where the How does this bind to the UI? How does this scale out over time? How do you manage caching and permissioning and how do you do the collaboration story? And so let me know if you disagree, but I think that it really is like form factor.
Yeah, it's been honestly a really fun journey. The other day, or maybe last night, in the AM, Josh was sleeping because he had to do the live stream. I didn't have to. But anyway, a bunch of us were looking back at the dock where we planned what we were going to ship. We were like, man, we had a lot of scope creep.
Effectively, all that scope creep was incrementally made sense because we kept leaning further and further into this idea that this is not just a model that's good at coding, but rather this is an agent that is good at independent software engineering. And the more we leaned into that, the more things started to feel really special. I'm going to just label and then set aside the entire conversation around the compute platform that Josh has been leading. Let's just take the model, for example.
We don't just want it to be good at code, and we don't just want it to solve, say, Sweebench tasks. Sweebench is an eval, for those who don't know, that has a certain way of functionally grading outputs. Because if you look at a lot of Sweebench passing outputs from an agent, they're not really like prs that you would merge because like the code style might be like different like it works but the code style
So, you know, we spend a lot of time making sure that our model is great at adhering to instructions, great at inferring code styles so that you don't have to tell it. But let's say that you got then a PR and then it was like the code style was good, it followed your instructions well. It still might be really hard to merge if you have this like enormous description, like just like model of like how it thought about building it.
And you probably need to pull it onto your computer to test the change and validate that it works. And maybe that's okay if you're just running one change, but in a future world that we imagine where... Actually, you know, maybe the majority of code is actually being written by agents that we're delegating to, you know, doing tasks in parallel. It becomes, like, critically important that you can actually, like, integrate those changes easily as the human develops.
So for instance, some of the other stuff we started to train was like PR descriptions. Like let's really nail this idea of like a good concise PR description. that highlights the relevant things. So our model will actually write a nice short PR description with a PR title that adheres to your repo format. We have a way to prompt that more if you want with agents.md. And then in the PR description, it'll actually cite relevant code that it found along the way or relevant code in its PR.
So you can like mouse over and just see it. You know, and perhaps my favorite thing is actually the way we handle testing. So the model will attempt to test its change.
And then it will tell you in this like really nice way with just like a checkbox kind of thing whether or not those tests passed and again it will cite if the test passed like a deterministic reference to the log so you can like read it and be like okay i know that this test passed right or if the test failed it'll be like hey like this didn't work i feel like you need to install like pnpm or whatever and you can like read the log and like see what it is
So those are some of the things that I think I've lost track of the original question, but anyways, those are some of the things that we're really leaning into as we build this software engineer agent in the cloud. I think also it feels very different. You can look at the features, but I think for me, the feeling...
It takes a leap of faith the first few times. You're like, I'm not really sure if this is going to work. And it goes off for like 30 minutes, right? But then he comes back and it's like, wow, this... This agent went out, wrote a bunch of code, wrote scripts to help codemod its own changes, tested this, and it really went through the full end to end of thinking about the change it wants to make. And I, you know,
had no faith that at the start that, that it was going to be able to successfully do it. And after using a bit, you're like, wow, like it actually, you know, I pulled through. And so that kind of long running independence. It's something that's hard to really summarize. You have to really try it. But it finally feels very different. Yeah, I used it. I opened a PR for it a few minutes ago. I wasn't the lucky first 25% of people to get the rollout.
Yeah, it's very nice. I kind of shortcut it because I couldn't figure out how to run RSpec in Rails, and so I just checked the syntax of the Ruby file, and it was like, looks good to me. But I think it doesn't have the agents.md yet, so I think once I set that up, that'll be good. No, just don't use Ruby, man. It's got issues. Once it's good enough to migrate the whole thing, then I'll do that.
I mean, it is funny that there is, just briefly on the note of, like, don't use Ruby or not, there's, like, a bunch of things that I think teams can do to, like, make better use of AI agents. Stop using the word number two. But yeah, if you could list some things out, that's like best practices, you know, I know that from the live stream.
that they mentioned pro users install linters and formatters so that basically these are in-the-loop verifiers that the agent can kind of use, which turns out to be dev best practices as well, but now the agents can auto-use it. Commit hooks have always been a tricky thing for humans.
Because I've been on teams that are like, no, everything has to have a commit hook. And then I've also been on teams that were like, no, this thing gets in the way of committing, so let's rip everything out. But actually for agents, it's actually really good to have commit hooks. Yeah, I mean, you took the words out of my mouth. I think the three I was going to say would be, one, agents don't empty. And we put a lot of effort into making sure
understand this hierarchy of instructions, right? You can put them in subdirectories and it'll understand what which others. So over time, right? Like, I mean, we also have like O3 and Fora writing our agents empty files for us. I love the tips. You actually open source the prompt descriptions here. Yeah. Yeah, yeah. And it's a highlight. I mean, I think... I would start simple and not try to overdo it.
A simple agent's IMD will get you a long way rather than no agent's IMD. And then it's more of like you learn over time, right? What we would really like to do is auto-generate this at some point for you based on the PRs you create and the feedback you give. we figured we shipped faster rather than later. You mentioned you have 03, 04 writing HSMD for you as well.
Yeah, I'll give it my entire directory, right, and just say like, hey, produce an agent's empty. Well, actually, these days I'm using code one to do it because it can traverse, codex one, sorry, to traverse your directory tree.
for you so um yeah i would recommend um and slowly gradually investing in then you know you took the words out of my mouth like getting very basic linting formatting up that's your really big wins cuz it's similar how like if you open a new project VS code right like you get some out-of-the-box checking the agent starting like it as a human you're sort of starting without that advantage and so this is trying to get that back
Yeah, I don't know. Do you have anything else? Yeah, so one analogy there, and then actually I have just some thoughts we've observed of even, like, using other coding agents, like just any coding agent, you know, how to prepare for that.
you know the analogy i kind of like is like you so if you start with like a base reasoning model actually you basically have this like really precocious like incredibly intelligent incredibly knowledgeable And like weirdly spikily intelligent, you know, college grad. But we all know if you hire that person and ask them to do software engineering work independently, there's just a lot of practices that they're not going to know about.
And so kind of a lot of what we've done with Codex I is basically give it its first few years of job experience. And, like, that's effectively what the training is, right? So that it just, it kind of knows more of these things. And, like, if you think about it, like, a PR description is a classic example of that, like, writing a good PR description, right? And possibly knowing what not to put in it, actually, right?
So that's when you get there. So now you have this weirdly, weirdly knowledgeable, spikily intelligent college grad with a few years of job experience. And then every time you kick off a task, it's kind of like their first day at your company. Right, and so agents.md is basically a method for you to kind of compress that test time exploration that it has to do so it can know more.
And as Josh said, right now it's a research preview, so you have to update it yourself, but there's a lot of ideas we have for how to make that automatic. So that's just a fun analogy. Yeah, maybe the last one I'll say is like make your code base discoverable. It's like the equivalent of
maintaining good engineering practices for new hires that you make and letting them understand your coding faster, right? A lot of my prompts start with like, I'm working in this subdirectory. Here's what I'd like to accomplish, right? Can you please... Do it for me. And so giving that guidance.
yeah okay i'll give you three sorry three things for like generally so first like language choice i was hanging out with a friend the other day it was a bit of a late converter ai and he was like oh yeah i want to try building like an agent's product Like, should I build it in JavaScript? And I was like, you're still using JavaScript? Like, no wonder. Like, you just, like, you know, use at least TypeScript. Like, give it some type.
So, I mean, I think that's a basic one. I don't think anyone listening to us now needs to be told this. Another one is, like, just make your code modular, right? The more modular and testable it is, the better. But you don't even have to write the test. Like, an agent can write the test, but you kind of need to design the architecture. I saw this presentation recently by someone here who was like,
They weren't vibe coding, it was a professional software engineer, but using tools like Codex to build a new system. And they got to build the system from scratch, and there was kind of this graph of their commit velocity. And then their system had like some traction. So then it was like, okay, now we're going to port it into, you know, the monolith that is the overall Chatshapiti code base.
That, you know, has seen ridiculous hypergrowth and so maybe is not like the most architecturally pre-planned. And like their commit rate, the same engineer, same tooling, actually the AI tooling continues to improve. Their commit rate is like plummet. Right. And so I think the other thing is just like, yeah.
architecture like good architecture is like even more important than ever and like i guess the fun thing is like for now that's something that humans are really good at so like you know kind of good you know important for the software engineers to do their job I don't know. Just don't look at my code base. Yeah, well, definitely. But the last thing is just kind of a fun story, which is the codename, the internal codename for our project is WAM.
like wham um and We chose it, actually, as a research lead, and he was like, hey, make sure you grab the code base before you choose the code name. So we searched the code base, and the string wham was only present in a few larger strings and never present as its own string.
And that means that whenever we prompt, we can be very efficient. We can just say in wham, right? And then... WAM code that is like, you know, for our web code base or our server code base or in like our shared types or anywhere else. is, like, really efficient for the agent to find, right? Whereas, let's pretend that, like, alternatively, we would have called our product, like, Chachapichi code.
Not saying we didn't consider that. Then it would be super hard for the agent to figure out where we wanted to direct it to, and so we'd probably have to provide more relative folder paths. So there's a lot of this stuff, like, as you start to think ahead, like, oh, I'm gonna have an agent that's gonna be using Terminal to grab, uh, then, you know, you can start, like, naming things intentionally.
Would you start naming things less for human's readability and more for agent readability? Like what's kind of the trade-off in your mind? Yeah, I... It's interesting because I definitely had different priors coming into OpenAI. I currently believe that The systems are actually very convergent, like there's a lot of Maybe it's because as long as you see humans and AI writing it, maybe there's a world where it's only AI's maintaining a code base and the assumptions change.
But the moment you started to break that fourth wall and a human's coming in, doing code review, deploying the code, it has human fingerprints all over it, right? And so how humans communicate to AI. how humans communicate a bug that needs to be done, or communicate business requirements, right? All those things aren't going to go away immediately, and so I think the whole system still feels that.
I think there's a cooler answer I could say that it's like, oh no, it's like this alien thing, it's completely different. But I don't know, I think it's like these are started off as large language models. There's a lot rooted in human.
By the way, if there's somewhere you want to take this, you should actually cut us off, because I realize we're just kind of monologuing between each other. No, no, I think this also ties to the agents.md, right? It's like, why is it called agents.md and not readme.md? There's kind of like, I guess, in your mind, some fundamental difference.
with how the agents and the human consumes the information so i'm curious if you think that's that the class naming level it's just at the instruction level like where where does it break down kind of yeah okay so this is like a few options for this naming right which we considered so you could go for reading readme.md you could go for contributors.md Right, you could go for codexagent.md and then maybe codexcli.md as these two separate files.
right um but like that are sort of like branded there's also cursor rules weight surf rules yeah yeah and then you could go for agents.md right and so like there are a few trade-offs here i guess one is like openness like and And one is specificity, I suppose. And so when we thought about it You know, we thought about, well...
Probably there's things that you want to tell an agent that you don't need to tell a contributor. And similarly, there's things you want to tell contributors to really help them set up in your repo or whatever that you don't need to tell the agent. The agent can just figure that out. So we were like, okay, maybe this is going to be different.
You know, the agent's going to read your readme anyways, so like maybe agents.md ends up being like the stuff that you need to tell the agent that it's not like automatically figuring out. So we kind of made that decision. Then we considered. Okay.
There are different form factors of agents. The most special thing about what we are building and shipping is an out-of-the-box way to use a cloud-based agent that can do many tasks in parallel and can think for a long time and can use a lot of tools safely.
We thought, like, well, you know, how fundamentally different is a set of instructions that you want to give that from an agent that you're working with more collaboratively on your computer? We had a good amount of debates about that, to be completely honest.
And then we ended up concluding, like, actually those sets of instructions aren't different enough that we need to namespace this file. If there is something you need to namespace, you could probably just, like, say it in plain language within the file. Then the last thing we consider is like, well, okay, how different do we think the instructions you have to give like our agent are to the instructions you might give to an agent running on a different model or built by a different company?
And we just think it kind of sucks if you have to create all these different agents and whatever. Part of why we made the Codex CLI open source is a lot of problems, like safety issues that you need to figure out for how to deploy these things safely. And no one should have to figure these out more than once. So that's why we went for a non-branded name. And I have one specific example of why Readme and Agent Time
For agents, I don't think you really have to tell a code style. It looks at your code base and just writes code. to that, whereas a human's not going to take it. sorry, their time to go through the code base and follow all the conventions. So that's just one example. At the end of the day, there are differences between how developers approach it.
Cool. I think that's a really good set of advice. I think you just gave us our... episode title like we're just going to call it best practices for using chat gpt codecs and you know i mean i think people are going to want best practices i i so i noticed like something that's very interesting right like um i think there's always a two versions in terms of building agents one which is you try to
be more controlling. You're trying to make it more deterministic. And then the other, you try to just prompt it and trust the model. And I think your approach is very much prompted trust the model. I see inside of the agents.mda system prompt that You just prompt it to behave the way that you want, and you just expect the model to behave it. Obviously, you have control of the model, so you can train it if it doesn't do well. But one thing that makes me question it is...
How do you fit everything in context? What if I just have a super long HSMD? In your live stream, you had it demoing on the OpenAI monorepo, which is just giant. how do you manage caching and context windows and all that. Yeah, I mean, would you believe me if I told you right now that it all's stiff? Not to open AI repo. No, I'm sorry. Everything that the agent needs. Right. So you reify the agent's empty, put it at the top, right? It's just like another system problem.
No, actually, it's a file that the agent knows how to, like, graph and sadfar, right? Because there might be multiple ones. And so you can actually see it in the work log, right? It's like going to look for, it very aggressively looks for an agent on a day it's been traded. I'll say it's been really interesting joining OpenAI and seeing how when you're thinking about where models are going and what AI products will look like years from now. you design a product
Right, like before OpenAI, especially when you don't have access to a team of researchers and many GPUs, you're building these deterministic programs, right? A lot of scaffolding around how this operates. But you don't really let the model... Operate as false capacity, right? A lot of, it was interesting when I just joined, actually, I got a lot of pushback saying like, hey, why don't we just like hard code? Like, listen, you keep using this tool wrong. Let's just say in our prompt, don't do it.
And then the researchers will be like, no, no, no, we don't do that. We're going to do it the right way. We're going to teach the model. why this is the right way to do it. And I think that's related to this overall thought of where do you put the deterministic guardrails in and where do you really let the model think, right? Similar conversation around planning. Should we just have an explicit planning stage where it's like, think out loud first, write down what you're going to do, and then go?
Sure, but what if the task is really easy? Do you really want to think of this? What if it needs to replan as it goes? Do you have all these if-else conditions, heuristics to do that? Or do you train a really good model that knows how to switch between those modes of things?
And so it's tough like I definitely have advocated for like little girls here and there until like the next training runs done but I think that's really like we're really building for this future where the model is able to make all these decisions. What's really important is that you give it the right tool. You give it ways to manage context, manage memory, manage ways to explore the code base. Those still are really important.
Yeah, that's super well said. I think building here is super fun and different, and the model isn't all the product, but the model is the product. Right. And you kind of like need to have this kind of like humility in terms of like thinking about like, OK, well, what are the things that like.
There's like three parties, right? There's the user, the developer, and the model, maybe, right? What are the things that the user just needs to decide up front? And then what are the things that we, the developer, are going to be able to decide better than the model? And then what are the things that the model can just decide better?
And every decision just has to be one of those three. It's not like everything is the model. For instance, we have two buttons in the UI right now, like ask and code. You know, those probably could get inlined into the decisions the model makes. But, you know, right now, it was just really, like, it made sense to kind of just give the user choice up front because we spawn a different container for the model.
first, based on what button you press. So if you ask for code, we put all the dependencies in. I'm going to oversimplify here. But if you don't ask for code, if you're just asking a question, we do a much quicker container setup before the monolite gets any choice.
And so, you know, that's maybe a user decision. There's some places where, you know, user and developer decisions kind of come together around the environment. But, like, ultimately a lot of, like, a lot of like agents that I see are really impressive, but it's basically like part of what's impressive is it's like a bunch of developers building this like really bespoke state machine around a bunch of like short model calls.
And so then the upper bound of complexity of problem that the model can tackle is kind of actually just what can fit in the developer's brain. And over time, we want these models to capture or to solve for much more complex problems. just by themselves on more and more complex individual tasks. And then eventually you could really imagine that you get a team of agents working together, maybe with one agent that's kind of managing those agents.
and you know the complexity just explodes and so we really want to like get as much of that complexity as much of that state machine as possible like pushed into the model and so you end up with these kind of two modes of building like in one place you're like building product ui and rules and in the other case
You still have to do work to get the model to learn something, but rather what you have to do is you have to figure out what are the right things that this model needs to see during its training to learn something. And so it's still a lot of human work to like figure out how to get that change, but it's like a very different way of thinking of like, we're going to get the model to see.
But how do you build the product to get the signal? So if you think about the code in ask, it's almost you're basically getting the user to label the prompt in a way, right? Maybe they say, ask, this is an ask prompt, code, this is a code prompt, not other, any other. kind of like fun product designs like as you built this of like okay we think the model can learn this but we don't have the data this is how we architect codecs to kind of help us collect that data.
I think file context and scoping, we don't have great built-in things like that right now. is another example of this, right? Like you could have, we're often usually pleasantly surprised as, oh, it was able to find the exact file that I was thinking about, but it takes some time, right? And so a lot of times you'll shortcut a bunch of chain of thought by just...
Hey, I'm looking at this directory. Can you go? So I think that'll probably be there for a bit until, you know, you have some better architectural indexing and search. Yeah, I'll add to this. I'm actually going to double down on my thing about how do we think about it. So, you know, one thing we might consider is like context window management, right? And like, should we intervene here?
And so we can do a product intervention, write some code to intervene. And then kind of the next level of thinking, maybe like a little bit more AGI-pilled, is like... Okay, let's get the model to see context window management stuff in this training. I can't even come up with an example now at this point because I'm too agey. I felt like, I don't know, we could come up with something that it has to see to learn how to manage its context.
But it's specifically tasks related to context window. But then the most AGI pill thing to do is to be like, we don't actually need to think about this problem. The model will just figure it out. All we have to do is give it harder and harder problems. And then it will just have an emergent property of managing its own context. Because that's the only way I can solve these problems. right so I'm kind of slightly like oversimplifying here but like
You know, basically the model learns to manage its context. And so when you were talking about like it working in the monorepo, it learns how to be like efficient with the way that it spends its tokens as it's like browsing and like setting and like, you know, in your example of like there's a giant agents.md. I guess we were just to show it some versions where there was that, and so it learns it shouldn't read the whole thing every time, and it should first figure out how many lines it had.
So anyway, summarizing, I'm like, we just need to keep giving it harder and harder problems. And a lot of these things that we might be very tempted to build a sub-intervention for, it will just have to figure out. And if it doesn't figure it out, maybe it didn't matter. Sure. I totally get that. I think we don't really have online models yet, right? And that's kind of what you need for your vision to be real.
For what it's worth, I wasn't thinking about a giant Agents MD. I was just thinking about hierarchical Agents MD with a lot of coaches. I think one issue where you have this version where the model is the product, is your dev cycle as the Codex team, like the two of you, it's not as tight because you have to be like, okay, every time there's a bug, all right, now I need to go get data.
And where do you get the data? I don't know. Maybe employees use it. Maybe you buy it from vendors and you hire some human raters or whatever. And then you have to train it in and then you have to go test it again. It's very slow, isn't it? Yeah, I think it's definitely... Yeah, I think it's definitely like from a building perspective, you have to do this when you're really willing to play the long-term vision of like, we're going to build a better model, maybe even a better model.
bespoke for a certain like functional purpose like codex one and then we're going to generalize the learnings from that model into like an even bigger model that's like getting all these other like learnings from other functional purposes. And these together will become a really powerful thing. And that's kind of the philosophy we have with training models so far. And it has been working, but it's definitely like a long-term plan.
You know, like another example, we do this on occasion, like for example, recently we released GPT 4.1. like really good coding model. And again, that was like based on like working. We were like, hey, we want to invest better in this area. Let's hang out with a bunch of developers, understand their feedback, how things work. you know create some evals and you know like you said this is like it's a lot of work to do that
But then we end up with a great model and even more exciting, we can then take those learnings and put them into our mainline models and then everything benefits. And you kind of, the sort of philosophical view, I don't know if I can factually prove it or not, maybe someone here can, is that like,
If you can do something very specific for a specific purpose, actually when you bring that and you bring it into the generalized model, you might even get outsized returns on that because there's transfer from all these different domains. Okay, cool. I think we had a couple factual things to wrap up on just Codex itself, and then we wanted to double click on the compute platform stuff, which I think Josh, you wanted to cover more on. So I noticed in the...
It was between 1 to 30 minutes in length. Is that a hard cutoff? Have you had it go for longer? Any comment on that? Yeah, I mean, I just checked the code base before this. Someone else has a question. Our HRCOV is an hour right now. Although, don't call this to that. It may change over time. The longest is I've seen two hours when in development mode. So, you know, I think 30 minutes. These are hard tasks that require a lot of iteration.
Yeah, I mean, yeah, I think actually, like, our average is, like, pretty, is significantly lower than 30. Yeah. But if you give it a hard, if you give it a hard task, you'll end up at 30. Yeah, I mean, you know, I think there's a couple analogies here. One, I think the operator team needs a benchmark where they had to cut off for two hours.
And then the other one is the meter paper, which I don't know if has been circulating, where they estimated that the current average autonomous time is like an hour, and it's maybe doubling every step. So like an hour sounds right, but also, I mean, that's the median. So there's going to be some that are, that go longer than that. Yeah, totally.
Is this part of the, you had cutoffs for a few, like 23 SweetBench verified examples that were not runnable. Was that part of it in terms of length or was there just something else? Yeah, to be honest, I'm not exactly sure, but I feel like there's a bunch of sweet bench cases that actually are like, invalid might be too strong of a word, a little bit, you know, not sure, but like, I feel like there's like issues with running them, so they just don't work.
Okay. And then max concurrency, is there a concurrency limit? If I have like 5, 10, 100? 5 and 10 is totally fine. I feel like we did introduce a limit for fraud reasons. I don't know what it is. 60 an hour. Wow. so one per minute i'm just gonna yeah but like this this is literally the point right like it is
So, like, long term, we actually don't want you to have to think about if you're, like, delegating or, like, pairing with AI. Like, if you imagine an AGI, like, super assistant, you just, like, ask, you just talk to it. And it just does stuff, it answers quickly if it needs to, you know, it takes a long time. And you also don't have to like only talk to it, it's also just like present in your tools, right?
that's like the long-term thing but in the near term like yeah this is a tool you delegate to and The way to use it that we see, like, you know, going back to, I guess, maybe the title of this podcast of, like, best practices, it's, like, you must have an abundance mindset and you must think of it as, like,
like not using your time to explore things and so like you know often when something a model is going to work on your computer and it's going to work on your computer you'll like really craft the prompt because you know, then it's going to use your computer for a while and maybe you can. But the way we see people who like love codex the most using it is they don't they think for like maybe 30 seconds max about their
It's just like, oh, I have this idea, like, boom. Oh, there's this thing I want to do, like, boom. Oh, like, I just saw this bug or, like, this customer feedback thing, like, and you just send it off. And so, yeah, like, the more you're running in parallel, actually, I think that, I mean, the happier we are and, like, the happier we think, like. Users are when they see it, like that's just the vibe of the product.
Yeah, I would pass my own anecdote. So I was on the trusted testers team for this thing, as both of you well know. And I was using, I found out I was using it wrong. I was using it like cursor. Like I had my chat window open and I watched it code. Yeah. And then I realized I wasn't supposed to. And I was like, oh, like you guys are just firing the things off and like, you know, going on about your day. And yeah, that was a change of mindset.
Yeah, one thing that's quite fun is use it on your phone. Because somehow just being on your phone just flips the way people think about things. So we made the website responsive, and we'll pull it into the app eventually. So try it. It's actually super fun and satisfying. Okay, so yeah, it's not... Yeah, because there was a voice, there was one of the videos that was showing the mobile engineer coding with it on his phone, but it's not available in TragiPT.
Yeah, not yet. Just one question I got from the mobile. I got the notifications that I got when it starts the task. It says starting research the same way the deep research notification is. Is it using DeepResearch as a tool or did you just reuse the same notification? We just used the same notification, yeah. so you mentioned the compute platform you mentioned how you share some of the infrastructure with rl can you maybe just give people a high level of like
what the codex has access to, what it doesn't have access to. It doesn't look like people can run commands themselves. They can only instruct the model to do it. Any other things people should keep in mind? Yeah, so I'll say it's an evolving discussion as we figure out what part we can get folks access to. and what we need to hold back for now, right? And so we're learning and it's really, we would like to give humans and agents alike as much access as possible within safety.
What you can do today, right, is as a human, set up an environment, set up scripts that get run. These scripts typically will be installing dependencies. I expect that to be maybe... 5% of the use case there and just really get all the right binaries in place for use we actually do have like a bit of an environment editing experience where as a human you can drop into a repl try things out so you know please don't abuse it but there's definitely ways for you to interact with the environment
We laugh about that because earlier I mentioned scope creep. We weren't planning on having a REPL to interactively update your environment. You know, anyway, Josh was like, oh man, we need this. And so that was like an example of scope creep. Thanks for doing it. We do every limits in place and we do monitor that.
carefully, but there's interactive bits of there to get that going. But once the agent starts running, what we actually do today, and we're hoping to evolve on this, is we'll cut off internet access. Because we still don't fully understand what letting loose an agent in his own environment is going to do, right? You know, for now the safety tests all have come back very sturdily, like, you know, it's not susceptible to sorts of certain exfiltration.
prompt injection but there's still a lot of risk to this category so we don't know and that's why to start we're being more conservative there and when the agents running it doesn't have But, you know, I'd love to be able to change that, right, allow it to give limited access to certain domains or certain repositories. And so, all this to say, it's like something we're evolving as we build up.
I'm not sure that quite touches on your original question. The last thing though that I do want to mention is like there's an interactivity element with like
As the agent's running, sometimes you're just like, oh, I want to correct it. Tell it to go somewhere else. Or, let me maybe fill this part out, and then you can... right we haven't quite solved those problems either what we really wanted to start was to like shoot for the fully independent just like deliver massive value in one shot kind of approach but yeah we're definitely thinking about how we can weave human and agents together better. I mean, for what it's worth.
I think the one-shot thing is a good angle that the other people... This is me comparing you to alternatives like Devin. Factory and all the others there are more focus on multi-shot human feedback all these but like you know I saw I have a website I'm working on and I gave it request and I compared it all the others that was my test for Codex and it did one shot it I posted it The screenshot is a tweet just earlier today.
And I think it's really good, especially if you're running 60 at a time. So I think that really makes sense. But it is a very ambitious goal because human feedback is a crutch that we like to use.
It also, I think, makes us write more tests, which is annoying because I don't like to write tests, but now I have to write tests. Fortunately, I'm now getting Codex to write my own tests. And I really like that on the live stream as well. You can just kind of ask it to just look at your code base and just suggest stuff to do. I don't even have the energy to figure out what I should be doing. Yeah, I had delegated delegation. I thought that was a great line.
And we're not saying one form factor is better than the others, right? Like, you know, I love using code XTLI, and... As we talked about in our interview, when I was interviewing OpenAI, you really want both modes. But I think what we see as the role of Codex here is to really push Frontier on. single-shot autonomous.
Yeah, I kind of think of the research preview as like our... like our thought experiment it's like you know what yeah what is running agent like in its purest like most like a gi build or scale fail scale perform look like and then maybe i mean for me personally like I don't like it.
Part of what excites me about working at OpenAI, it's not just like solving for developers, but it's just really thinking about how does AGI benefit all of humanity and what does that feel like to non-developers as well.
and so like for me like what's really interesting is like thinking of codex as an experiment for like what it'll feel like to be in other functions um you know doing work and like the goal for me to build towards is this is a vision where it's like We do the work that's ambiguous or creative or hard to automate in whatever way, but otherwise we just have agents that we're delegating most of the work to.
But these agents, they're not like this long horizon thing versus short horizon. They're just kind of ubiquitously available with you. So yeah, we decided to take the purest form to start, which we thought would be the smallest scope thing to ship and probably isn't. But yeah, and then we're going to bring these things together.
Okay, I think we have time for a couple questions. I'm just going to double click on the research preview a bit. It is a research preview. What is left? What do you think would qualify it to be a full release? On the live stream, Greg mentioned the seamless transition between cloud and CLI. Is it that or are there other things on your mind?
I mean, to be completely honest, you know, the part of why we believe so much in iterative deployment, like, I can give you some of my thoughts now, but, like, also we're really curious to see, because this is, like, such a new form factor. But, you know, some of the items that are top of mind for me are, like, multimodal input. 아멘 Another example would be just giving it a little bit more access to the world. A lot of folks have requested for forms of network access.
You know, I also think that right now kind of the UI that we shipped is actually one that we iterated around. It's like a fun story there. And it's one that people find useful, but it's definitely not the final form of what it is. And we would love for it to be much closer with the tools that developers spend time in. So those are some of the themes we're thinking about. But to be clear, we'll iterate and figure that out.
I wanted to ask Why did you put finding a typo as one of the onboarding things? because i used it and then i saw it and it's literally just grapping for potential type it's like searching grapping for like selenium with an n or like something with like some tgn instead of ng but it went through like 50 of these and then finally found Will miscalled this W-I-L without the two things.
but it was really cool to see what it thought like default spelled as like d-e-f-u-a-l-t it's like it just grabbed all these different things and then eventually got there like why why did you make that sense Honestly, Tiwo and I were talking about it and he was like, listen, it would be funny if I had a typo as I type this prompt out and just make it a little bit meta. You know, nervous fingers on a live stream.
maybe optimize for that. I have noticed it. It likes to do TEH to look for the and so it's a work. That was great. Any parting thoughts, call to action? Are you growing the team? Do you want specific feedback from the community? I think for me, the one thing that is really on my mind for getting better at over the next few months is really helping you customize.
in a more high-fidelity manner. It turns out the good news is the agent can do a lot of really good work with only the basics, right? It's much like if you're Dev machine is borked and you're sort of looking at your editor, but none of the type checks are working. A lot of folks can still actually do a lot of good work, but how do you get closed at last, you know, 30, 40?
It's really hard because there's such a wide variety of environments out there, but especially would love feedback from folks on how they would like to see their environment. Do they want to ship us a darker image? Would they rather have us support dev containers? So the form factor of how you do the DX of how you do environment customization is still very much an open question that we need.
Yeah, big plus one to that. And I think for me, maybe the thing that I'm most interested in is Hey, this is There's like a new shape of tool to collaborate with. And I'm just really interested for people to try working with it in as many different ways as possible and like kind of figure out like
where does it work well in your workflow? Like you said, you know, you mentioned earlier, you were trying to use it kind of like your IDE and then you realized it was different, right? So I would just love for people to take advantage, especially now, like we're very intentionally just providing like very generous rate limits so people can try it. We just want you to try it and figure out what's next.
What works? What doesn't? How do you prompt it? And then we want to learn from that and use that to lean in. Yeah, I guess my parting call to action is like, please go try it out in Chachpt. Use it as much as you can, especially now. And then let us know how you like to hold it, basically. Yeah, I'm worried about the pricing when it happens, but yeah, I'm going to abuse this. Why not? Why not? Yeah, send us feedback on pricing too. Yeah. Okay. It's too early to talk about pricing, right?
Yeah, it's too early now. But yeah, based on cloud code, that's a thing that people are worried about, right? And cloud has started to introduce some kind of fixed pricing and variable pricing. And I think it's a huge mess. There's no right answer. Everyone just wants the cheapest. of code attention they can get. Good luck. Thanks. I mean, my take is, I don't know what's going to make it into it. But like...
We aim to deliver a lot of value, right? And it's on us to show that and really make people realize like, wow, this is like doing very economically valuable. And I think a lot of the pricing can fall from that. But I think that that's where the conversation should start. yeah awesome all right well thank you so much yeah thanks for working on this and thanks for sharing your time um it is uh it's been it's been a long time coming but i think people can start seeing like
OpenAI in general is getting very serious about agents. It's not just coding, but coding obviously is the one loop that is self-accelerating that I think obviously you guys are super passionate about. It's really inspiring to see. Yeah, super excited to just, like, ship everyone this coding agent and then, yeah, like, bring it together into just, like, the general AGI super assistant. Yeah, so thanks for having us on. Thank you, guys. Cool. Thank you.