¶ Intro / Opening
A year from now, where do you think the platform will be? We'd want to experiment with directions where Claude actually gets so good at the end. itself. It figures out what model you should be using. It figures out how to spin up all the sub agents you don't much about what kind of architectures are there because what is actually able to understand itself enough that it can write itself on the fly.
In that world, if Claude is on the fly, your agents on the fly are becoming what they need to become in order for you to do what you're trying to do, the platform has to seriously scale. How close are we to Claude Make Me a Billion Dollars? Is that's really what I'm asking. Angela, Caitlin, welcome to the show. Thanks for having us.
So for people who don't know, you both work on the platform at Anthropic. So Angela, you're the head of product for the cloud platform and Caitlin, you are the head of engineering for the cloud platform. Um I'm I'm really psyched. To talk to you because A, you've been launching a bunch of stuff. You have Cloud Manage agents that uh came out recently, you've been launching new features for it.
And I think that it it comes at this really interesting time where it makes me think about what actually is a platform in AI for a model company? Because In the GPT three days, the platform was a completion endpoint. You just like send a prompt to get a response.
Um after that it was like a completion endpoint with tool calling and a couple and like chat sessions, like that kind of stuff. And now it like with cloud managed agents, you're essentially getting a cloud on a computer with memory and all this other stuff. Um So I'm I'm just trying to un I I'd I'd love to help I'd love for you to help me unpack that trajectory and like what it means to build a platform in AR.
Yeah. Um I think like, you know, your your characterization is like very accurate. I think like as we've kind of like as a lot of these kind of like technologies have evolved with the LM like first starting and then I think like uh putting that
¶ How the Claude platform evolved from API to agents
fun. A lot of people were like, Wow I could like do some I think it was very cool. Now we'll probably look back at it Um and then you know, I think like we've moved more and more towards like a slightly more like stateful world as you kind of like want to persist the kind of like sessions state um to be able to make sure that the kind of performance of the model is
better and better. I think that that's probably like actually the the through line. Like as a lot of these kind of like um as we make improvements to Claude and as it continues to get better and like more autonomous, we find ourselves like basically needing to kind of like Evolve the platform to be sort of like higher and higher order abstraction, but it's in the pursuit of like helping you get the best outcomes out of something.
I think in the very beginning, you know, we were very like everyone was very exploratory. It's like you'd have no idea what people were going to build with these LLMs and you wanted to kind of have as much possibility out there as um as available. And then as those use cases started to kind of narrow down, like people star now like building agents with it. And more and more of that is about, you know, like customers coming to us.
Like how do I get the best out of Claude? Um, how do I like set up my tools? How do I run the loop? And so on and so forth. And you have some people who are like really, really experimenting and they're on the edges, and that's great. And then you have like just a whole host of other folks that are coming in who are like, I kinda want a lot of this stuff like out of the box.
Um, and in our pursuit for get make making sure that like Claude is basically producing the best outcomes, we find ourselves like enriching the platform to be richer and richer and richer. contained in that is like both the state, it's like the tools that you start to see us adding, um it contains a lot of
sort of the cloud components of a lot of these types of things. Um but it's in pursuit of the same mission of like just making things literally as easy as possible. And I think in probably, you know, the forward state of a lot of these maybe the philosophy of what a platform ultimately ends up doing, it probably ends up just being like whatever it na it's like the set of primitives and infrastructure that enables you to basically get the outcome as fast as possible, um with actually as little
And I think that that tends to follow a certain form factor, at least in this current state. Um but yeah.
¶ The primitives that make up Claude Managed Agents
How would you characterize what the primitives are today? So maybe that's just asking, what are the primitives in CloudMap? Yeah, so CloudManage EGENT is built on all of our same primitives that you could otherwise build on directly, so the messages API. Um and within the messages API we've built a whole bunch of I guess maybe innovations around the API. Like you could just get tokens in and out if you really wanted to, but
You know, you can use some of our built in tools. You can build you can use stuff like code execution, spawn a sandbox, and execute work. Um, you can use, I guess, like, you know, web search and all these sorts of different things. And so I think we've taken what we see as all the most powerful of those things and put them together into a harness and a set of infrastructure that is, you know, just the way to get what we think is the best outcomes out of Claude.
So I'm sitting here feeling this sense of I've been thinking of it as like time deflation. Like my time gets more valuable. in the future as opposed to uh the opposite. Whatever but whatever the opp the opposite would be. My time gets less valuable in the future. Um
Uh and and and the reason is because we're so for example, internally for us, we're building an agent. We're building some agent products where it's like agents that do specific things for us internally and then hopefully for customers. And in order to do that, we've like, you know, we have a couple of Mac minis with, you know, Claude running in a loop on the Mac Mini, right? And a lot of that and and it's like a thousand line Python file or whatever.
And a lot of that mirrors what you guys are building in cloud managed agents and So for for for me and I think for a lot of people building on cloud or on the cloud platform or ecosystem, there's o there's At least I feel this. Maybe we should just wait for you guys to build it. Um but then I don't know what the lines are and uh and I yeah, I'm sort of wondering w If if I wanna build an agent, like what is the best path to do that in a way that aligns with what you guys are
Yeah. Um I think, you know, this is this part of the the kind of platform business is actually somewhat similar to any other form of the platform business where you do have customers like like yourself who are building and, you know, you're kinda thinking, should I
go ahead and do it'cause maybe I have this like immediate need, but at the same time I don't kinda wanna like, you know, repeat the work per se and you could've just when you could have just gotten it for free um out of the platform. Um And also infrastructure sucks. It's so it sucks so much to like spin up servers. And I can't believe you do that all the time. That part everyone's like, that's cool.
But I I will actually say, um part of why we ended up building cloud managed agents was because Anthropic ourselves had gone through enough of these iterations where we built products that were agents that you could run autonomously in the cloud and we did that
stand up the infrastructure so that it works well, sort of work enough times that um we ourselves were like, okay, we're done building this for ourselves. Like we're we're doing it once in a way that's gonna really work from everything that we've learned, but also for all the people who are doing it.
You can run whatever you're running on a couple of Mac minis, maybe, right? And for a lot of people that could work. But I think if you're building agents into your product and you're running something really at scale, right, like that's where it really starts to become more and more challenging to get that information. Yeah.
One is is a bit in the way that we kind of design managed agents, which is that we try to have it be modular enough. Like we want to be opinionated about some pieces that we feel like should be, you know, very well like married to the Claude model. Um but then we Uh Like oftentimes like the way we want for example, we want Claude to like very specifically useful. Um that's like a very particular like clawed kind of
general. Just file systems in general. We w we also really want to lean into skills. I know like a lot of folks like skills, but like that's something that we like we want to have our hardest be really opinionated about that. And so we're kind of particular about like those
primitives being the case. So like use the file systems, use the skills. They're really basic. But at the same time, like we still find people who are like still trying other methodologies to go do that. And we want to kind of like help you, you know, when you build to start just kind Um so that's one piece uh on some of the kind of more opinionated ones, but as each one of these kind of like, you know, endpoints or
or um APIs that we have as part of the suite. We try to like open them up a little bit in certain areas. So there's like things that you know we're looking uh kind of forward to and being like, you know, from maybe it's not available today, but in our design um we are trying to make it
flexible enough for people to kind of like add in different pieces because we recognize that this API or suite of APIs is not necessarily going to solve like maybe everything in its original construct and they're gonna open up. Um and then the second bit is like, you know, we're we're kind of public about this is like when we do design a lot of these things, we do put out like blog posts and sort of like reference implementations. So if you did want
I think that's to the to the point you just made, that's something that's that's coming up for us. Again, we have, you know, clauds running on a Mac Mini with a Python file and a couple other like uh you know bigger, more serious implementations on like, you know, cloud infrastructure that we're trying to figure out what to do with.
And I think I I told the team that w that we were talking today and I think one of the uh One of the questions that they have or or one of the feelings of consternation that they have considering using cloud managed agents for this kind of thing for spinning up agents for our customers is Just right now it's a like we have a playground. We have we just have like a little we have a server or Mac Mini. We can just like
pipe stuff to to Claude, it can do anything that Cloud Code can do. It has a file system, it has a browser, it has like all this stuff. If we wanna, you know, switch it switch it out to GPT five point five or Gemini or whatever, it's like pretty easy to to do that. Um So is that kind of and I feel like they they they feel like they're we're gonna get if we use a cloud manage agent, we're gonna get locked in and it's not gonna we're not gonna have the flexibility to do all the stuff that we want.
And it it there there's also a worry that features are going to come to Cloud Code itself that won't be in Cloud Managed Agent for a little while and that it'll prevent us from being at the edge, which is sort of what we promised to our customers and really to ourselves. Like we just love being like just doing whatever the new thing is. How do you think about that?
¶ Why the harness and the model are becoming a single unit
Yeah, so I think the what's nice about the way that we work internally, I guess, is like so we run the platform and the platform for what most people think of it as is our externally facing APIs and our suite of APIs. Um
The other rest of what our team actually does is internal platform in the sense that all of our first party products are built directly on the same platform as everybody else. And so What's cool about that is we're we spend all of our time not all of our time but a lot of our time working with
the teams internally who are building on top of the platform and kind of enabling the features that they will build, sharing ideas and these sorts of things. And so I think over time you'll maybe see less and less divergence of um, you know, like what might be a available in cloud managed agents, what might be available in cowork or cloud code that might sit on top of the same infrastructure, right? Like that's I think one way to think about that.
Yeah. And then I think, you know, on your point around or your team's point around, like, you know, having some kind of like model lock in fear. I think that that's like valid. Like many folks kind of have that consternation. And I think we're kind of at this place where there's a bit of like an evolution here where, you know, if you look back, um
maybe even just a couple months ago, it was very standard to kind of build a very, very, very generic harness. It's super generic and then you can kind of hot swap models. across all of those things. And I think for kind of an older generation of models across labs, that kind of worked like okay. A lot of things were were moving at a pace where I think that that was like mildly reasonable. I think now for the next kind of generation of models and
see it forward. I think you kinda see this a little bit from every lab. Like everyone's taking like slightly different techniques and perspectives on how they want to kind of advance their particular form of the model. And so in theory I guess you could do kind of the superset of all those things. But more often than not, I think, you know, like when you build agents for your company or for your customers, you do want to deliver like an outcome ultimately for them.
And so I think that that level of abstraction of like what you're actually hot swapping stops becoming this like really generic harness and hot swapping the model and it gets more to like the harness and the model get very paired. You still need redundancy and you still might want to use
for things, but you probably do it at the layer of like the agent, meaning like the harness plus the model, um, rather than necessarily the other architecture of like, you know, really, really generic harness and underneath. That's really interesting. Is that how I don't know the cursors of the world are doing things? Like do they have a a separate harness for each model or is it a generic harness that they're kinda hot swapping the models in and out of, do you know?
Um, I'm not entirely sure. My uh intuition would be that like I don't know about cursor in particular, but there have been like teams that we have talked to who fallen on similar kind of perspectives. And it's mostly because they're just trying to squeeze the most out of each model to kind of like uh almost like harness engineer like every single like nuance.
And you know, one example that we have, it's it's not an external customer per se, but um something that we've done a lot internally. Like we recently launched like memory, for example, with with managed agents. Um and we tried a bunch of different different harnesses ourselves. Like we tried one that was
the one that we ended up launching. Um we tried a bunch of others using a bunch of different other techniques. And um at least personally for myself, like when I saw the kind of like eval suite from like the team, the each one of these harnesses performed drastically.
And so I think like just even looking at something like that, um, shows you that like you can actually hill climb Together and I think if you were to just take that forward across like all model combinations, across all different labs, all different kinds of providers, there is a lot of alpha in that kind of construct, and so I wouldn't be surprised if more. have experimented with that level of like
It's really interesting that there's this path dependence where you make some choice for how you do requests and responses or how you do tool calls or whether you're f you have the model wanna use file systems or not. And then that sort of like Changes. Yeah. of all of these different models.
And it feels like maybe at the time like such a small, almost like, you know, kind of like footnote. Um but it ends up uh becoming very Do you think that that will of end up affecting the model's generalizability in the sense that at some point um they they'll just have these sort of uh maybe locked in lanes of stuff that they're good at because they're you know
Cloud is really good at file systems and OpenAI is, you know, GPT is good at s some other things. Like yeah, how how is that gonna uh how's that gonna flow through the model's like personality and behavior if it's like locked into a specific way of doing I do think it does actually kind of tend to lock the model. So like what um what we end up like kind of treating as like the right paths and the right primitives need to be like very
thought through. Um and so like I think in the in some eras uh you know like of of other models uh they become really really really good at like reasoning and then they almost like over optimize on that level of reasoning and there's other perspectives Okay, like yes we want it to be really good at like a computer. Like maybe the computer part is the the
And so if you think through maybe some of the the primitives, which we could get right, we could get wrong, but at least we'll like go through the thought process of like that will probably at least lead us, you know, one path or the other. Um I think it's hard to say like, you know, on what
Per se will ultimately be true, but I do think there's a lot of like path that dependency it ends up taking. So being really like thoughtful about what you choose to actually include or the model more natively is really important. Are there any of those path dependencies that you've had to undo? Probably. Um I can't I can't speak enough about that at the anthropic level.
Like a couple of months, but I have to imagine that that has been the case. Um, I mean, we've experimented, like even at other labs, like the the kind of like constantly changing. Um and you do kind of hit like a little local maximum and re Approach to the idea. Yeah. Interesting. Um I I want to take a take a step back and and ask you something that maybe I should have asked at the beginning, which is But who who is Cloud Manage Agents for?
Right. Like I I set one up uh earlier today. We've we've got some people already using it in production inside of Every and I just I just did one today. I really loved the um the sort of like getting started chat experience that you that you had and the sort of um some of the examples that you had and it it felt to me like
even if I was n not technical, I might want to use this to set up an agent. It it might be a little bit complicated, but what I actually did is I just uh and I'm sorry to say this, but I did it in the Codex in app browser. So I had Codex driving the uh the manage agents set up and it like I had a Slack bot working pretty pretty quickly. It was like it was really cool. So how do you think about when you're designing stuff when you're designing cloud manage agents, who it's for?
Yeah, so it's interesting because I think you're right that especially with that quick start experience, which we actually felt pretty strongly about launching, not specifically for the sake of making it so that non-technical people could go and build agents, but actually just for anybody technical or not. to be able to wrap their head around the primitives like the GPI. And here's that fist.
Yeah, exactly. Like, you know, the the kind of education portion of it. Um, but I think when we think about who is for We think about a couple different things. One is we're seeing people internally within companies build automation or build really powerful platforms or systems. Like we've seen people say, I want
you know, a full end-to-end software development platform, right? And like manage agent is a perfect solution for something like that. Or, you know, I want to automate a little process over here where like legal has to review my marketing copy, right? And things like that. Um And so you shouldn't have to re implement memory and like all that stuff every time.
doing that right you can get started really quickly and you can get something running quickly the other user that's top of mind for us is people building into their products that they expose to their customers. Um and so that's the other one where actually yes like you do still want a lot of customization. You do still want to make something that's gonna be really powerful for your product.
But we still like definitely, definitely believe that not spending your engineering resources on the infrastructure and on all the little harness engineering tweaking sort of stuff is like. We've talked like a month ago. But I am I am sort of curious, okay, so maybe infrastructure is one of these things, but um when you see people setting up agents, what do they what do you see them think the hard thing is and what ends up actually being the hard thing and are they the
¶ The infrastructure wall that kills most agent projects in production
Good question. I I maybe this is I don't know, spicy, I'm not sure, but I think I think people think the harness engineering part is the hard part. Um and so actually like, you know, in the past we launched the agent S D K, which is what you guys I think are using um on your Mac minis. And for a lot of people they were like, okay, great, I don't have to do the harness engineering part where I have to do prompt caching and I have to maximize my context window and all these sorts of things.
I think we're just actually using just claude in bat, like the claude dash P. Oh wow. Okay. It's it's pretty good. Yes. Cool. Okay, cool. And but regardless, like you guys did that because it takes off your hands building the harness, right?
Um but I do think what we saw with a lot of customers was okay now I want to go and take that thing and like get it into production and scale it and everybody hits an infrastructure wall. Like everyone hits the same problem of like, oh wow, I either need to like keep a server constantly running or I need to use infrastructure that will spin up and spin down and I need to store the transcript data and I need secure sandboxing and all these sorts of things and so
Um, you know, and like if you boot a clawed code session or you boot the agent SDK in a sandbox and like that's the thing that you have running, but your sandbox loses connection and dies or whatever, your whole agent dies, right? And so I think the infrastructure part especially is the wall that most people end up hitting, but they're more expecting that the actual harness engineering and like getting the most out of the model is the part that's gonna be harder.
Yeah, I totally agree with that. I was just gonna say like you know we we talk to so many people who are now at a place where they're like prototyping really quickly and they're super excited and it's like it's doing the thing. And then yet there's like a class of people who are you know really pushing and being like, okay, I I do want to hill climb. I w I really want to edit the hardest.
But then once you have that thing, like productionizing is just a freaking nightmare. Um, especially for the the more interesting, kind of long running async ones. Remotely that are a bit more autonomous. Um and everyone kinda runs into that wall and was a big inspiration for why we I feel like uh one of the like err examples of uh the shape of an agent is open claw.
Um and in particular the the thing that it has brought to us internally is you have an Oiz on agent in Slack that has its own personality and it has its own like part of the world that it like ends up working on. Are you guys like is is that a possible future for like, okay, a one click agent that lives in my Slack that yes, I can go set up all the internals, but like I don't have to really think about all of the
um, you know, the technical infrastructure stuff. Um,'cause I I think you you all have the the beginnings of that, but it's still like a lot of steps from the current manage agent to something that's always on in my Slack that I have to like set up and customize. So is that Does that fall in the realm of pl platforms job or is it like too far in the product direction?
No, it it definitely is uh something that we really want to do. I think like, you know, we we focused a lot on kind of the infrastructure piece to start because that's where we just see a lot of
Um but yes, like I think in like in its like you know, I don't want to say exactly say final shape, but in its like advanced shape, we actually want to make it so that you can kind of deploy these agents really, really easily. Like um we've made like some light steps in this direction. Like for example, we included bulk. Um is one of the primitives as just kind of In vaults store your like keys and stuff, like your OAuth keys? Credentials, yeah.
Um as like, you know, kind of solving some of the lower level pieces as a starting point. But once you kind of wrap some of these more sort of like agent identity type of primitives in a more secure way and you can handle it really easy and
like system, then uh you know, I think it's very natural for us to get to a place where maybe you are either one clicking uh Slack integration or alternatively even maybe just telling you know Claude like add Slack and it just like handles absolutely everything. Before you know it, you're the little body.
I love it. I've uh I can't wait for that world. Uh um what are the best internal use cases of agents? Because I I think there's this big question happening right now where, okay, yeah, everyone's in codecs or cloud code, but then now we have these agents that are out in the cloud.
Now everyone inside of a company can like have their own agent. There are team agents, there are company wide agents. So what are the patterns that you see for when people make really useful internal agents, what they do and what they look for? Yeah, I would say we similar to and it we've actually seen a few examples of these in some of the more like AI-pilled, AGI-pilled companies like um Stripe built minions, and they talked about that a lot as they're kind of like
end to end development platform that their engineers could use. Um I think Ramp did something similar and Um we've done similar things as well, right? Like yeah, we've built kind of platforms internally that are, you know, I have agents running that I can talk to from Slack or from wherever, right? And Um at a certain point that becomes actually like a pretty thin layer on top of managed agents. Like you don't have to do very much to accomplish.
That's what I was thinking. Like I looked at minions or whatever ramp does, and I was like, it why why? You know? So is it is it actually useful to have a sort of like thin coding agent that anyone in the company can use? Or like why not just install the Claude app? Yeah, I would say the difference in a platform like that and some of the things that we've done internally
is there's a lot of customization that you might want to do on, you know, the development environment where an agent is actually running and able to verify its changes, right? And and things like that. Yeah, exactly. And so, you know, I think for lots and lots and lots of people, like cloud code is an excellent tool, right? And and you can run cloud agents with cloud code and and that is really great. But I think
If you're trying to do a little bit more end-to-end development, right, and you maybe want to bake in more custom things, then you could start with something like managed agents and build a layer on top of that and end up with something that's maybe closer to that end-to-end experience.
¶ Why team agents need a different shape than individual productivity tools
Also seems to me like there's something in particular about having a team that you need to work with that makes a the manage agent shape important as opposed to it just all works in clawed code. Like I guess technically you could like sync the skills between everyone's clawed code, but like there's something about just we ha all have one agent that does this thing.
Yeah, I'm really glad you brought that one up because I think like um that's actually like one of the more common areas where we see uh a lot of the opportunity is that uh to your point, you know, there's a lot of like individual activity that's happening, whether you're a developer or non-developer, there's like so many tools that you're using to just
More automated, more you know, high leverage. Um, but then when you get to the team layers, suddenly everything gets like massively more complex. Like number one, obviously it can't on your laptop. And yes, you could maybe like, you know, put it in the cloud, but it's again more for yourself to kind of like handle with your laptop closed. But then you go to like, okay, then well now like the three of us want like, you know, a couple agents that interface.
And work with each other, and then maybe we're automating a process kind of end to end, and especially for some of the more complex processes that you kind of envision. being like really transformed with AI, you do need like, um you do need that kind of like team orientation. Um, and that needs to happen at like a layer that's a slightly higher bit of abstraction.
just a single agent. Um and I think some of the teams exploring, you know, kind of multi agent or like that are really exciting, but it needs to be built on top of a a little bit of like a platform that everyone kind of spin up and down. Um and I think Gee from Vercell like had a really good perspective on this in a way where I think his company Vercell is obviously incredibly like AI pilled and he kind of describes it as sort of like an AI like software factory like internally.
Um and I think that's exactly like the right mindset and that like produces um you know like extremely high leverage organization that's really just like creating a tremendous amount of productivity, but not just for themselves, just like for every single Mm-hmm. And I I really want to go back to this, like, okay, agent use cases. We've got coding agents that that anyone can use in the company. Like, what are the other ones that are that you see people standing up that are really easy?
¶ How Anthropic's legal team uses an agent to review marketing copy
We've seen a few so one of the fun things that we get to do is just kind of work with our internal teams of different functions and like help them identify because we actually just get to learn a lot as a result of doing that. And so the silly example I brought up earlier of like Legal team needs to review marketing copy was one of the ones that Very real.
Like extremely real and like really blew people's minds with like very basic agents that just give people the right setup to be able to do that. Well what does that actually do? So it's like there there's marketing copy and there's a legal agent that is just like watching what everything marketing does and is like stop like it is more like okay.
I'm a marketer and I've written some copy, right? And n in the past maybe you would have opened a ticket or something and be like, Can you please review this copy? Um but instead you submit it to this like, you know, little app that we built on top of agents that is like, Okay, cool, now I'm gonna go As an agent, review first.
And then put it in legal's inbox as a already first pass review was done. And maybe actually like the agent is it's clear enough that it can say, Okay, marketing you're good, right? Or maybe it's still like, No, this needs like an extra human review. And so Yeah, it's just and that's a sort of thing where again, just thin layer on top, um but you can build the you know, you have access, I have access, we can both see the outputs and we can work together on that.
Okay, but then so for example, why is that not a scale? So it's a it can it very much can be a skill, and that actually is like if you You You would probably build that agent as a, you know, legal reviewer agent, right? And so you would have MCP servers or whatever it is that help you access external context. You would have skills that help you understand like
here's what rules we have to follow and not follow right and all those things and you'd put all those things together but then you can just fire off a session with that agent. And then I think the last piece you need, and this is where I'm saying it's a really thin layer, is just like the form factor on top where like different people can collaborate together and like work with that agent and multiple agents can be involved in the system.
Um and so I think it goes a little bit broader than a skill because you kind of still need like the right form factor for the agent to be able to go run and then for people to be able to interact. Another core bit of why it's like not a skill is because, or not exclusively a skill, is because you actually do need human in the loop. Um and so like if you were to automate the whole thing and
you know, like taking the skill and looking at yourself from like legal skill, for example. Like in that world, of course, you could have just done a peer skill. Um but be if you need a human in the loop to be like, okay, like I want to review and I do want to check and I want to like if we're looking at like legal things and so there's a bit of like, you know
That's sort of necessary. In order to automate that entire process, you kind of need like agents to go do the thing. And so because you need to spin up sort of separate sessions for that to happen, some sort of stitching is necessary that can't be instantiated.
That's really interesting. Yeah. Um okay. So just to push on that a little bit, so what is the best practice for you? You create an agent that uh its its job is to make sure that when marketing is writing something, they can get it approved really quickly by legal. And sometimes it'll approve things immediately. Sometimes it sends stuff to legal. And ideally, it's like getting better all the time. So it can do more and more, right?
What is the best practice for who owns that agent once it's built? Because one of the things that we found is If you don't have a human who's responsible for the agent, it gets stale very quickly and then it ends up being kind of this like dead thing that's all just like out there doing stuff but it's not necessarily good.
And also, uh, even if it kind of works, there's all there are gonna be all these times where legal's like, you asked me to approve this, but I don't really need to approve this thing. Like, let's update your prompt. So like how does that all work when it works well?
So it's actually really interesting because so the form factor thing, right? Like the app that sits on top of that that we originally built, um, one of our teams worked on that, right? And and like kind of sitting with these teams and understanding what they needed. Um and they were kinda like, Okay, here you go and we're gonna go do other stuff now and like let us know how this goes for you.
Um and then a really cool thing actually ended up happening where people on those teams who were using the tool were like, Oh, I wish like this little thing could get tweaked or this thing could get better. Um and they like popped open Claude Code and like made some of the changes themselves to the actual and so it's funny. Is your team responsible for approving the PR or does it just like go
Uh usually my team's responsible for reviewing the PR if it's a system that we actually own. But um but yeah, like people can kind of self-serve making changes to those things, which I think is really cool. So Um it is I I do think we're still in a stage for a lot of teams, a lot of companies. Like even going back to, you know, like Stripe has minions, right? Like Stripe has a large developer productivity team. Uh we used to work at Stripe, so we spend a lot of time with them.
Um, they have a large developer productivity team. They're awesome, and they're obviously putting a lot of work and energy into building platforms and tools like this. And so I think we're definitely still in a place where. Something like managed agents or being able to build on top of our platform is really powerful, but you still kind of need the like AI-pilled people and technical people.
within a business to then go like create something really excellent on top of that that works well for whatever you're trying to do.
That's interesting. Yeah. I I l I love the anyone can open a PR to to do this'cause everyone's using cloud code. One of the things that I find talking to people who are i in infrastructure roles at companies where this is starting to happen is like You you know that you know the meme where it's like um there's there's a person and he's like going like this and he has like daggers in his like back and covered Like infrastructure people are that. Now anyone can like can let can submit PRs. Um
How do you how do you deal with that? And how do you do that well?'Cause obviously like in an ideal world you would love for a legal to be able to submit PRs to improve this agent and also Um, sometimes they're probably gonna submit stupid stuff that wastes time. And so what are the what are the right ways to either organizationally, like culturally or technically, like make that possible without ruining your your life?
For this particular one that we've constructed that Caitlin's given as an example, we actually have like a couple layers of abstraction away from like that kind of like PR layer. So at the very beginning it kind of like started that way and to kind of like basically prevent users from
kind of foot gunning themselves a little bit, uh, they kind of get to a place where oftentimes their way of interacting with the agent that they own, like the whether it it's the marketing team who owns the marketing agent requesting or if it's the legal The review. Um, they actually engage with those agents through Claude itself. So they actually spend more of their time like kind of talking.
Claude and then Claude will oftentimes figure out what should be the right way for them to go and handle it so that they're not kind of like, you know, hopping straight down to the absolute core bit and doing something that may result in And they're talking to Cloud or Claude Code? Like Cloud Chat or Claude Code or Cowor? Instantiation of ch of of Claude um that we made that actually is a managed agent in and of itself. So it's just kind of like manage agents all the way down.
construct. Uh but it we found that each layer, if we kind of tune and and prompt each variant of the managed agent, it helps to solve like you know different parts of the problem for users. So at the end state for like that marketing person or that legal person. It is like a really simple interface where the way that we tell them is like you're just talking to Claude. But under the hood it's many, many claws engaging with each other to get to the part where then they the claws themselves
¶ Using multi-agent orchestration for advisor strategies, adversarial pairs, and swarms
Interesting. You guys just launched multi-agent orchestration. What are the coolest what are the coolest things that people are doing with One of the more interesting ones is like, um I think people are using it to like construct
sort of different harness techniques. And that one I'm personally very excited by. Um because like there's different techniques that people have experimented with where um you know like for example we recently did like the advisor strategy one. But really if you were to genericize it you just separate
And then there's also one where you can have like two you know modes where there one is generating someone something and the other one's adversarial to it. And then there could also be sort of like you know you split it into a bunch of Little tiny pieces and then they kind of recombine. And then there's ones where maybe it's kind of something closer to like best of N kind of like style of thing. And then there's so many more. And like in each one of these,
like architectures or strategies. They are good for a very So some of them are much better for like uh deep research or wide research type of uh style use cases, right? And there are others that are like these are like the kind of ones where they all sort of swarm together are better for like bug hunting, for example.
Um and so like that's like really cool to see that like if we can make the primitives very Lego like, um then people can put them together to solve things at a slightly higher form factor, which is more like an architecture or like a strategy. Um and they get much more like interesting results out of that. Um and that's like really exciting to see because it also suggests that you can actually hill climb at uh multiple layers um of abstraction.
¶ How to measure agent success with outcome and budget as the end state
How do you know if an agent is successful? How do you measure success for an agent? Yeah, I mean there's like e balls and stuff like that, which everyone has talked about, like ad nauseum. One direction that we we really like is like uh this kind of verifiable outcome.
Um we've been somewhat opinionated on that one and it's almost like in the absolute end state of, you know, we talked a little bit about what's what's a platform at the end of things. Um going from that philosophy, it's like our kind of principle of like maybe the end state of some of these things is that Everything should kinda compress down to an outcome and like a budget.
And that's probably like about it. Um and everything else should be figured out for you to kind of resolve exactly across those parameters. And so for us, we're kind of, yes, we still have evals, we have a lot of these other things that we measure um that are domain specific.
Some coding evals would be like you might want to measure like just the actual PR getting merged, those are more verifiable. But as we get to the place where um you know like an outcome is actually a spec that you are just as a human able to define and our ability to interpret that and regrade itself. Claude, make me a billion dollars. Your budget is ten dollars. Exactly. I'd say no mm. Let's go. Exactly.
Maybe Mythos could do that. Um and then w one of the things that we've been running into that I'm curious if you have a solution for is Agents like get outdated pretty quickly. Um, sometimes because there's no human attached to them, sometimes like they're just running an old model or there's an old or in an old architecture or whatever.
And it feels like there needs to be a um a end of life cycle for agents. Like we've talked about having like a little like funeral for them and like having like a little page on our website that's like here's all the decommissioned agents and stuff. Um
Like ha how do you manage, especially in a co in a in a really big company, how do you manage the all the agents that are like sort of out there but and maybe they're like in Slack, like pinging stuff once a week but you're like, This is super stale? How do you make sure that uh you you retire them as quickly as you are making them. So one of the things we have actually done is we have made skills
that help you do things like upgrade to a new model when a new model comes out, right? Like we've actually put a good amount of work into making it easier to do exactly what you're talking about. Um
And I think maybe some of the most like AGI pilled people are like running agents that are monitoring their agents to see if their agents are, you know, like outdated and in need of that sort of stuff. But I think for, you know, the way that we like to talk to customers who ask us this question, I do think the the most interesting instantiation of this is there's a new model.
And now I need to go upgrade my agents or maybe be done with those agents because, you know, the new model in enables me to build agents that are way more powerful and do more interesting things than the old agents did. Right. But I think that upgrade process and that migration process is like something people have had to wrap their heads around is like, it's like a breaking change and I have to like put actual energy into making that work. Um
And obviously sorry to talk about evals but like if you have evals, this process is easier and things like this. But um I do think that's one of the things we've tried to do is how do we give you skills and how do we give you the right like just tools to make that process easier. Um and then you could go be AGI pilled and choose to actually automate more of that with more agents.
So a year from now, we're back at Code with Claude. Um where do you think the platform will be? What will I be able to do uh and how it will be different from what I can do?
¶ What the platform looks like a year from now, when Claude writes its own harness
You can go first. A year's a long time. It's cr in this in this industry especially. How close are we to Claude Making Me a Billion Dollars? That's really what I'm asking. Yeah. Yeah. Yes. We will be asking Vlad for this week. Uh uh.
I mean, yeah, like we wanna get closer and closer to that that state where I think we we kind of okay, so a couple of things. I think in a year from now, I mean, one thing that we'd love to get really, really close to is actually that kind of like simplicity and this might be
a significantly higher order of abstraction. I don't know what the form factor will look like or whatever, but the kind of parameters we will care for from users will be that outcome. And of course it has to be verifiable. There are some parameters That have to be restrictive, and the budget. And I think like we'd want to experiment with directions where Claude actually gets so good at understanding itself.
It it figures out what model you should be using, it figures out how to spin up all the sub-agents. I actually don't think you need to think so much about harness engineering in that world. Today, you know, you don't have to think so much. more aggressively about like tool construction, for example, like we've kind of made that a little easier and you get it delete a little bit of that scaffolding. prompt engineering too.
Yeah, exactly. Exactly. And I think if you just keep going up that stack, like today a lot of the innovation is happening at this kind of like like like really high level almost like harness architecture like level, which is really fun. But I think a lot of that honestly also kinda goes away where you almost like don't have to think so much about like model selection. You don't have
much about what kind of architectures are there because we probably put of would have like gone through enough iterations with Claude where Claude is actually able to understand itself enough that it can almost like write itself on the fly to figure out what is necessary.
two parameter world of like outcome and budget. I don't know that we'll get there like in a year, but I feel like we might be able to do like the outcome part of that with like maybe, you know, some bars of some arrow bars on on the budget. Really cool.
Yeah. Um okay, that was really cool. I'm gonna give you a slightly more boring answer, which is in that world, if Claude is like on the fly, your agents on the fly are like becoming what they need to become in order for you to do what you're trying to do, the platform has to like seriously scale at it.
Um, and so I do think some of this will be what are the right abstractions that actually enable that, right? Like somewhere on the primitive to higher order realm, right? But I do think So much of what our team is going to be doing is making sure that the tokens that people want to come in and out of Claude um are going to be able to come in and out of Claude because our system is scaled to meet
Not just the demand, but like in that world where it's just like you have agents that are like literally constantly running and recreating themselves and and doing this sort of work.
Um, you just need a system that, you know, can handle long running requests, can handle a bunch of differently shaped things. And so I think for us it's gonna be I never want the ability of the platform itself to be able to scale, to get in the way of what people would otherwise be able to accomplish with these things and so I think that's something that's gonna probably be very friend of mind when we're talking in a year.
Awesome. I'm excited. Um thank you so much for joining. I really learned a lot. Thanks for having us. Folks, you absolutely positively have to smash that like button and subscribe to AI and I. Why? Because this show is the epitome of awesomeness. It's like finding a But instead of gold. Pure unadulterated knowledge bombs about chat. Every episode is a roller coaster of emotions, insights, and laughter that will leave you on the edge of your seat. It's not just a show.
with Dan Shipper at So do yourself a favor, hit like, smash subscribe, and strap in for the ride of your life. Without any further ado, let me just say Dan, I'm absolutely
