Open source AI to tackle your backlog - podcast episode cover

Open source AI to tackle your backlog

Apr 17, 202542 minEp. 310
--:--
--:--
Listen in podcast apps:
Metacast
Spotify
Youtube
RSS

Episode description

Vibe coding, agentic workflows, and AI-assisted pull requests? In this episode, Daniel and Chris chat with Robert Brennan and Graham Neubig of All Hands AI about how AI is transforming software development—from senior engineer productivity to open source agents that address GitHub issues. They dive into trust, tooling, collaboration, and what it means to build software in the era of AI agents. Whether you're coding from your laptop or your phone on a morning walk, the future is hands-free (and All Hands).

Featuring:

Links:

 

Transcript

Jerod

Welcome to Practical AI, the podcast that makes artificial intelligence practical, productive, and accessible to all. If you like this show, you will love the changelog. It's news on Mondays, deep technical interviews on Wednesdays, and on Fridays, an awesome talk show for your weekend enjoyment. Find us by searching for the changelog wherever you get your podcasts. Thanks to our partners at fly.io. Launch your AI apps in five minutes or less. Learn how at fly.io.

Daniel

Welcome to another episode of the Practical AI Podcast. This is Daniel Witenack. I'm CEO at Prediction Guard, and I'm joined as always by my cohost, Chris Benson, who is a principal AI research engineer at Lockheed Martin. How you doing, Chris?

Chris

Doing well today, Daniel. How's it going?

Daniel

It's, it's going great on the on the road this week, talking talking about AI and security in in various places, which is which is always always fun. And, often, you know, things come up in those sorts of talks. One of the things I've got a lot of questions about this week, actually, is the impact of AI on coding workflows and vibe coding and all of those sorts of things. And really happy to have with us some really amazing guests today to help us talk through some of those subjects and also share what they're doing both on tooling and and models. We've got Robert Brennan, who is co founder and CEO at All Hands AI.

And then we have, Graham Newbig, who is co founder and chief scientist at All Hands AI and associate professor at Carnegie Mellon. How's it going, guys?

Robert

It's going good. Very good. Thanks for having us.

Daniel

Yeah. Thanks for joining. Maybe, one one of you or both of you could kind of give your thoughts generally. Like I say, even this week at conferences, it seems like half of the questions that I'm getting are around how AI is impacting developer workflows. You know, how many people are really using Vibe coding tools.

If you're using Vibe coding tools, you know, what impact is that having on on code quality? All all of these sorts of things. So I'm wondering, you know, from the perspective of, you know, all hands and the work that you all are doing, what does the kind of environment around these code assistant, vibe coding tools look like from your perspective right now? Kinda what does that ecosystem look like? And then maybe set all hands in the in the context of that would would be helpful.

Robert

For sure. Yeah. So there's there's there's a huge variety of tooling out there right now for code generation. So it's a very hard space to navigate. There's two ways I like to to bifurcate the space.

One is on the one hand, you have a lot of tools that are really meant for, like, rapid prototyping. There's some really cool stuff happening there, stuff like Lovable, Bolt dot dev, v zero dot dev, stuff. They tend to be very visual. You're getting, like, quick prototypes of games or websites, things like that. Some really fun stuff happening there.

And the stuff that's enabling, like, a whole new set of people to experiment with software development, people who, you know, maybe, like, designers or product managers don't really have coding experience, maybe have, like, very little coding experience, they can now build whole apps, which is super cool. And then on the other end of the spectrum, have stuff that is much more oriented towards, like, senior developers who are shipping production code. They're working on a on a code base that's gonna go and serve millions of users, where you have be a little bit more careful about what's going on. Also, really cool stuff happening on that on that end of the spectrum. And then the other way I like to to bifurcate this space is that you have some tools that are very tactical.

So stuff like GitHub Copilot, where it's, you know, inside your IDE, you know, it's it's suggesting code, like, exactly where your cursor is inside the code base. You're, like, you're zeroed in on a task, and the AI is just, like, helping you move faster through that task. And then on the other end, you have these tools that are much more agentic. Right? They're able to just take a quick human description of a problem, and then go off and work for five, ten, fifteen minutes while you go get a cup of coffee or work on a different problem or catch up on email, and then it comes back to you later with, with the solution.

And open hand sits basically on the right end of both of those spectrums. Right? We are really oriented towards senior engineers who are working on production grade code bases, and we're really oriented towards this more agentic workflows where you're giving an agent something to work on, and it can iterate forward on its own without you having to babysit it, without you having to be, you know, you know, squinting at your computer screen trying to figure out, you know, where you should be editing.

Daniel

Yeah. That that's super helpful. I'm wondering this is might be an interesting question, but I know, Graham, we've run into each other in the past as related to human language related work in another context. I'm wondering from your perspective as kind of chief scientist, but also a researcher, as you've dug into this all hands project and work and product, what has been surprising in terms of challenges around this and maybe things that were surprising in terms of easier than what you might have thought. Any thoughts there?

Graham

Yeah, it's a great question. Thinking back in hindsight, it's kind of sometimes hard to come up with surprising things because the things that were formerly surprising now seem kind of obvious. But one of the things that I actually wrote a blog post about before was right when the Open Hands project started out, we were kind of on this bandwagon of trying to create a big agentic framework that you could use with and define lots of different agents. You could have your debugging agent, you could have your software architect agent, you could have your browsing agent and all of these things like this. And we actually implemented a framework where you could have one agent delegate to another agent and then that agent would go off and do this task and things like this.

One somewhat surprising thing is how ineffective this paradigm ended up being from two perspectives. So the first perspective is it didn't really and this is specifically for the case of software engineering. There might be other cases where this would be useful. But the first is in terms of effectiveness, we found that having a single agent that just has all of the necessary context, it has the ability to write code, use a web browser to gather information and execute code. Ends up being able to do a pretty large swath of tasks without a lot of specific tooling and structuring around the problems.

And then the other thing is building many, many different agents is kind of a relatively large maintenance burden if they're not very easy to define. So we've basically gone of full in on having a single agent that can do many, many different things. But in order to do that, it has to have the ability to pull in whatever information it needs. So we have a framework called micro agents where basically you can pull in a new prompt or a new tool or something like that for a particular task, but the underlying overall agent is a single agent that can do many different things.

Chris

Quick follow-up on that. Just for listeners who aren't really familiar with agentic workflows and stuff, Could you talk just a moment about what that means? You know, if so, if you've been developing for a number of years in the more traditional workflows that we've all, you know, kind of all started out at, and, and now we're hitting this this world of agentic possibilities. Could you talk a little bit about what's different for the user from where they came from, in kind of traditional development environments to what agentic development workflows are like?

Robert

Yeah, so, you know, I think the sort of like step one of integrating AI into your development process was like Copilot, right, where it's really just plugging into Autocomplete. Right? We're all familiar with Autocomplete. We've been using it for decades. It just got a thousand times better all of a sudden.

Instead of just completing a class name, now it's writing like, you know, several lines of code. So that was like a huge boost to my productivity when I adopted Copilot. Was like, yeah, this is amazing. And then I was still, you know, for bigger chunks of code, was like going to ChatGPT, and I was like, hey, can you write a SQL statement that'll do x, y, and z? Or, you know, things like that.

And often, I found myself doing this workflow where I would ask ChatGPT or Claude to, like, generate some code. I'd paste that into my IDE, run it, get an error message back, paste the error message back into into Claude or ChatGPT, and I just do this loop. And at some point, was like, well, this is dumb. Like, I'm just shuffling text between one app and another. And that was actually when I I built my first, like, agent basically, where I built a little CLI that would just do that loop with with Anthropic on in the background.

And that's that's kind of like the core of what an agent is. It's it's doing a full a full loop where you basically you give a problem to the agent and say, yeah, like, okay. I wanna write a SQL statement that does x, or I wanna modify my app to add this new feature. The agent writes some code. It runs the code.

It sees what happens. It gets some kind of output from the real world, whether that's like the output of a command or maybe, you know, the the contents of a web page or the contents of a file, puts that back into the LLM's context, and then it can take one step forward closer to its goal. And then, you know, you can, as as you get kind of better and better and more accurate at taking one step closer to your goal, you can take on longer and longer range tasks. So I would say in the beginning, agents were really good for things that would take like 10 steps, you know, something really simple, like implement a new test and then make sure it passes. And now they can implement, you know, things that take hundreds of steps, which is really cool.

I mean, that's that's the changes that we've seen over the last, you know, six to twelve months is that they're able to take on these huge tasks. So I can say, implement feature x, you know, front end, back end, and add testing. And today's agents are able to just continue executing, stepping forward into that until it comes to full PR where all the tests are passing and it's just kind of packaged up and ready to go.

Daniel

Selfishly, maybe I'll pass on a question for I was sitting around with a number of people at the conference I'm at last night, and there were some opinions. This gets to some of what you were just talking about. I mean, some of what you talked about at the beginning about this being geared towards more senior, maybe more senior developers working in an existing code bases or something like that, but also what you were just talking about, about that kind of workflow. It was kind of the opinion around the group that I was with last night that, hey, a lot of these tools might be well suited to senior engineers because you can iterate like that and actually have a sort of smell test for what's going right and what's going wrong, but not really for less experienced developers or new developers who really don't have that ability. I'm curious to understand your perspective on that, and maybe who's sort of using this and who's using it successfully, I guess, is the question.

And what does that persona look like?

Robert

Yeah. I think it's important to realize that you still need to keep all the same code quality controls in place that you did before the age of AI, if not more code quality controls. Right? You need everything needs to go through code review. You need somebody who's familiar with the code base to look at the changes that are happening.

I would say one of the kind of failure patterns I see with the technology is a lot of times a junior a junior engineer or somebody who doesn't really know how to code, you know, Vibe codes their way to, like, a pretty good MVP because these agents are especially good at, like, Greenfield stuff. Right? They can build a, you know, a to do list app all day. And then as you layer on more features over the course of, like, weeks or months, the code base just starts to rot a bit. Like, the agent adds a bunch of maybe it, like, duplicates a whole function because it couldn't find the original function, or it just keeps expanding the single function so that it's like thousands of lines of code and has all these forking paths.

If you don't have somebody looking at the changes that are being proposed and critiquing them and like telling the agent, hey, you you added this new function, but we have an existing function that does that, or, you know, this function's getting too big, please refactor it. If you're not looking over its shoulder and critiquing its work, the code base will just grow into this monster, and you'll have to throw it all away because it's just it's beyond repair.

Daniel

Well, I do wanna get into some of the kind of unique elements of All Hands and the perspectives that you all are taking. One of the things, of course, that strikes me right away as I, it's even top of the web page when you go there is your approach to do this kind of open source. So open source approach to this kind of tool for developers. I'm wondering, both of you could speak to this, but maybe Graham, you could start in terms of, obviously you've built various projects over time and done research and been plugged into the research community. Why from your perspective, it important that at least some key portions of what you're building here are open source and what you think that might mean for these kinds of tools, including all hands moving forward?

Graham

Yeah, so there are a number of reasons why we decided to do this open source. The first reason is I think everybody in our community believes that this is going to be very transformative technology. And it may drastically change the way we do software development going forward. And we have two options. We have an option where software development is drastically changed for us by other people, or there's the option where we do it ourselves together.

And we believe in the latter approach basically. We believe that if this is going to have a big effect on software development, software developers should be able to participate in that. That's kind of the ideological point of view. The other point of view is we also believe that from a research perspective, open source, especially from the point of view of agent frameworks, not necessarily the underlying foundation models, but from the point of view of agent frameworks, open source is not ever really going to be behind the closed options. The reason why is because academia and all of these people really love this topic.

They really love working on it. If we have an open framework and we can provide a platform both from the point of view of having a good code base that's easy to experiment with and providing resources to people who want to do experimentation on these topics, then the open source community together will be just as good as any company that is working on this in a closed manner. And so instead of reinventing the wheel, we can all invent it together and come up with something really good that's good for developers, interesting for the academic community and other stuff like that.

Chris

Could you talk a little bit about how you bring developers into this process? Since that's kind of foundational to how you're operating, could you talk a little bit about what you're looking for, how you bring people into your community and kind of ramp them up on that?

Graham

Yeah. So it's kind of interesting. Our software is a little bit complex because there's necessary complexity in order to do things like make a very strong agent, give it all the tools it needs, allow it to run-in a safe manner and things like this. One thing that we try to do is we try to If people are interested, point them in the direction of issues they could start working on. We have a unique problem, which is a lot of the easy issues that would be good for developers to learn more about the code base are just solved by the agent.

We're still working through the best way to fix that. But especially front end stuff, we have a new front end capability that we'd like to have. We've had a lot of people join successfully through that. Then we've had longer term research projects where we collaborate together with people in universities And we've been pretty successful at doing some interesting things there, I think.

Daniel

Cool. Yeah. I'm wondering, Robert, from the perspective of obviously, sometimes this is hard to do on an audio podcast. But if you could just give a sense, I just logged into All Hands not that long ago online, so I see some visuals. But if you could maybe paint the picture so there's the open source side of things, which I'm assuming means people could maybe host all hands themselves, which might be interesting for some, but you also have kind of a hosted version of that.

Could you just talk us through those options, how kind of people can access this and you know, what they'll see, how they how they integrate or how they connect their code into into all hands to get started, that kinda getting started picture.

Robert

Cool. Yeah. Yeah. So for the open source, everything runs inside of Docker. So that includes the application itself.

You just run Docker run, and you'll see, you know, a web interface running at local host 3,000, and you can just drop in a prompt to the agent. You can also connect it to GitHub by generating a token inside of your GitHub settings, plugging that into the UI, and then you can start to pull and push to your repositories. It's a little bit tricky running things locally because not only do we run the application in Docker, but when you start a new conversation with the agent, we want to make sure the agent's work is done in a nice sandbox way so the agent gets its own Docker container to work inside of. So there's a little bit of trickiness we have to deal with a lot of, like, troubleshooting, you know, why isn't Docker behaving properly kind of stuff. So it's it's it's a little bit of a difficult application to run locally.

So we actually created app.allhands.dev where you can use OpenHands in the cloud. And this is a really just like it's pretty much, you know, one for one in terms of the functionality with the open source. But there's a bunch of convenience features because, you know, a, we we have this persistent server in the cloud, and b, we can take care of all the infrastructure for running these sandboxes for the agent. So, like, you know, for instance, when you start up a conversation in the cloud, sandbox comes up within, like, one or two seconds rather than having to wait, like, thirty seconds or so for it to start up on your local machine. And we also can, like, connect into GitHub a little bit more seamlessly because we can have an OAuth application where you just, like, one click login and, you know, where we can access everything.

And then the the cloud feature that I love more than anything is that, if you can if you leave a comment in, like, a, like, a pull request, like, say the tests are failing, you can just say, add OpenHands. Please fix the tests. And because we have this long lived server in the cloud, that can just kick off a conversation automatically, and OpenHands will just commit back to your to your pull request. Those are actually the interactions I love the most where I don't have to go into the OpenHands UI and, like, fiddle around. I just inside of GitHub or soon inside of Slack, I just, you know, summon the agent and it just does the work for me and I get to, you know, reap the fruits at the at the end there.

Graham

My favorite is programming from my phone. Yeah. So you log you log into the app and then just tell it what to do. Do that while I'm walking to work. And by the time I get to work, I have a full request to review. It opens up a lot of possibilities if you don't have to run it locally.

Daniel

Yeah. Yeah. I could imagine also just in the spur of the moment thinking of some great feature to add. And a lot of those things are lost, right? So if you have the ability to just, I know some people, or it's when they're running on a treadmill, they're coming out of the shower or something, they can just pop in and give a prompt and have some work be done and then finish getting ready and get into work. I love that idea.

Robert

Yeah. It's funny. I feel like I'm still getting a lot of coding done despite being the CEO of the company and being in meetings all the time. Because as I'm going into a meeting, I'll just like quickly be like, hey, do x y z, go into the meeting. And then once I'm done, it's just the codes that are waiting for me.

Chris

It's funny, you guys are actually already leaping ahead and answering the question I was about to ask, because I was thinking, Robert, on your first answer a moment ago. And that's really, you know, it's dramatically changing the workflow and the and, you know, not only the workflow, but you know, how and where you're coding and stuff like that. As what I'm kind of curious, I mean, this is it's it, you know, if you've been developing for a long time, this feels a little bit magical. And as you've had users come into this new workflow, what are the kind of the mindset shifts that are either challenges or maybe most welcome on the conversely, that you know, that get people productive and useful and recognizing the utility of this and benefiting from it? Because there's a little bit of a leap from kinda where they grew up into the bold new world of this.

What's that mind shift like and how do you get people through that?

Robert

Yeah, it's a great question. It's actually very similar to when I started managing folks. For one, like, you you just have to get good at thinking, like, oh, no. I should delegate this. You have to, like, kind of have that switch flip and, like, your instinct is, like, fire up the s code and just, like, start working, and you have to you have to have, like, have that moment of, like, oh, no.

Like, this is actually a good thing for the agent to work on or for my employee to work on. And there's also, like, a little bit of a trust thing. Right? Like, when I first started managing folks, I wanted to micromanage them. I wanted to, like, tell them exactly how to do everything, and it ended up being just more work for both of us and frustrating for them.

Once I learned to, like, trust my employees and know that, like, they might not do it exactly like like I would do it, but, like, they're gonna do a good job. They might need some coaching and some direction, but building building that trust over time is is really important, and it's the same thing with the agent. You know, the agent isn't always right. You do need to, you know, I like to say trust but verify. Right?

You need to read its code and, like, understand what it's trying to do and where it might have misunderstood something and maybe iterate a few times through either, like, a code review in GitHub or by just, like, chatting with it inside of application itself. But, yeah, very, very similar to that management experience of, like, learning to kinda take your hands off the keyboard and be really clear with somebody else about communicating these are the requirements, and here's how you can improve and things like that.

Daniel

Yeah. Graham, I I have maybe a question that I also get a lot of times. You know, one is actually related to you know, Chris just asked one question that I get a lot, which is the workflow related stuff. But the other question that I get a lot related to these types of tools is, hey, I've seen people create a lot of cool demos with these sorts of tools, small projects that you can kind of like sort of regenerate if it doesn't work. But if I'm working in a large existing code base to the points that were brought up earlier, that's where most development happens.

What are the technical pieces that have to be in place for you to have an agent work in a kind of larger code base or an existing project, and actually have the context that's needed to do things that fit, have the context of other things that exist in the code base, but also potentially the context of maybe it's a company style or other things like that?

Graham

Yeah, it's a good question. For reference, the OpenHands agent is the largest committer to our code base. So we're definitely And our code base is rather large and complex. So I just checked now and it had two zero nine commits over the past three months and the next closest contributor had 142. So it's doing pretty well.

But there's a bunch of technical pieces that need to go together to make that work. The underlying language model is really important. Fortunately, a lot of the core language model providers are focusing on this. We're also training language models ourselves. But the underlying language model needs to have a lot of abilities.

One kind of boring but extremely important one is the ability to edit files. So about six months ago, this was a major problem for most language models. They were not able to successfully generate a diff between what a portion of the file used to look like and what the new portion of the file would look like, or they would add an extra line or duplicate things or stuff like this. So this was a major problem. Claude is very good at this right now.

A lot of the other language models are kind of catching up to be good at doing this. Another thing that's kind of more on the especially a big problem for large code bases is identifying which files to be modifying. And this is somewhat less of a big problem than I originally thought it would be. I was imagining that this would be a really huge problem, but actually language models are pretty good. Even if you give them no tools to specifically search a code base or something like that, they use finding grep and all the other tools that a normal programmer might use and are able to navigate their way around the code base.

But I do think that code search and other things like this can help this. We have some preliminary results that demonstrate that it doesn't necessarily improve your resolution accuracy, but it definitely improves your speed of resolution. And so that's another thing. Being able to run tests and iterate on tests, being able to write appropriate tests that test whether a new piece of functionality or adding is actually working as expected or not, being able to try on the language model side, being able to try lots of different possibilities. So for example, one big failure case of a lot of language models is they try the same thing over and over and over again and get in loops and never get out of that.

Models like Claude are good at not doing this, whereas a lot of other models fall into this failure mode and don't do as well. So the list goes on. I could talk about this for much longer, but those are some of the most important parts, I think.

Daniel

Well, Graham, you're already getting into this, which is another thing that I wanted to ask about. Maybe you could comment from your angle of technical and research side. And Robert, I'd be curious on the kind of business product side about generally why you got into also building models. And for those that want to take a look, there's some really great models that All Hands has released. It's just all Dash hands on Hugging Face, you can read a little bit more and we can talk about the details here.

But yeah, maybe first, just like, why was that a step that you all felt was important and or kind of wanted to be part of your contribution to the space?

Graham

Yeah. So there's two reasons. The first reason is we are an open source company and we kind of philosophically believe in open source and openness. If you're relying on a closed API based model entirely, then you can never fully achieve that goal. Another thing is practically there are issues with customizability and cost for closed models.

And the best closed models are somewhat expensive. There's a non trivial cost involved with using them to do agentic tasks, especially because you need to query them over and over and over again. And so having another option that's more cost effective that we can either just use as is or possibly switch over to that for easier portions of a task, but use a more expensive model for the less easy portions of the task is something that would be useful. And then customizability, we have a lot of our enterprise customers or design partners asking for some variety of customizability, be it to their code base or to a programming language that they're interested in working with and other things like this. And if we don't have a model that we can fine tune, we are limited in our scope of things that we can customize.

So looking forward, that's something that we would like to do. And we're not done yet. We just released V0.1, so we'll definitely continue being interested in this in the future.

Daniel

Awesome. Yeah. I guess from the product perspective, Robert, terms also of the hosted version that you're running, the one that people can log into, are the models that you've you've built, are are those integrated to one degree or another in that in that kind of live product, or or what's the kind of road map there?

Robert

So yeah. So right now, it's all Cloud three dot seven under the hood. There are some really cool ways where we can build where we can build our models into the process. One is if we can route certain parts of the agentic loop or certain queries to cheaper model rather than putting everything through the most expensive model out there without sacrificing accuracy. That's that's really great for our users because we can pass those savings onto them.

So that's that's one really interesting path. Another path that we have where, we have a model that is specifically trained, basically to recognize whether OpenHands is on the right track to solving a problem or if it's like going off the rails. Right? So we built this model specifically based on the dataset that we've gathered. And that's a really cool product feature because on the one hand, like, you can just recognize, like, did we achieve did we did we solve the task or did we not?

And, like, report back to the user appropriately. We can stop the agent if it's, like, going off the rails, and we can say, hey. This is this is what's going wrong. Please reroute, you know, using this new strategy. We can also, like, launch several different trajectories towards solving a problem and then, you know, maybe pick one out of the out of the three that we launched and say, okay, this one looks like it's going in the best direction.

Keep following this one and kill the other two. So lots of really cool stuff we can do there by having a model that specifically knows kind of the inputs and outputs of what OpenAnts is doing.

Chris

Well, while you're talking about that, I'm wondering, could you talk a little bit, you know, with whether the models we've kind of talked about the, you know, the frameworks and stuff being open. Are you looking at models that you're creating being open? Or does that say as part of a proprietary offering? How how are you envisioning that from, you know, in terms of the what the models are, what they're addressing, whether they're larger or smaller models, what licenses apply, that kind of thing. Could you speak a little bit about your philosophy and strategy toward that?

Robert

Yeah. I mean, so far, it's we're opening everything up. Right? We've we've taken the position that we basically want OpenHands to be as useful as possible to an individual developer running it on their workstation. Right?

You know, we are a company. We do wanna make money. And so we are building some closed source features specifically for, like, large teams who are using OpenHands together. But so far, we've taken the position that basically all the research we do and all the, like, know how for how the agents do as good a job as possible at solving software tasks, that should be open source. That should be available to every developer.

And it's stuff like collaboration features, things like multi tenant, things like auditing, compliance, stuff that, like, big enterprises need that your average developer working on an open source project doesn't need. That's what we're gonna hold back and say, okay. This is closed source, and we're going to enable big enterprises to do this stuff, you know, the way that big enterprises like to do things.

Chris

And one other follow-up, just because I happen to work in an industry where security and privacy are really paramount. How are you thinking about like with, you know, with going instead of going off to one of the large foundation models via cloud, often that runs into challenges for enterprises that have security concerns, in particular, any thoughts on on or something that you can offer for when it needs to be, know, all held close, closely held data cannot go out onto a cloud connection, that kind of thing. What what you're thinking about that either for present or for the future?

Robert

Yeah. So so we basically have three offerings. We've got the open source, which anybody can run and use for free. A lot of security conscious companies do start with the open source because everything they can hook it up to Bedrock or, you know, a local model or, you know, basically, they can plug into the existing models that the company has approved. We have the cloud offering, which all runs through Anthropic, all runs through our servers, which is a great convenience for a lot of people, but kinda scares off some companies that are very security conscious.

But then we can also take basically all the infrastructure we've built for our cloud offering and ship it into somebody else's cloud. So you can run it all inside your AWS environment. You can connect it to Bedrock. So it's basically all configured to stay within your walls.

Daniel

I'm wondering kind of just thinking about like current current functionality and, you know, what what, Graham, you mentioned all of these commits from all hands in your own repo and some of those kind of easy, maybe first issues that developers could solve, maybe those are taken care of. How do you see the level of performance now? How are you all measuring that and red teaming that, testing that over time, and thinking about improving that over time. How do you even kind of consider something like that, given that there's so many different types of projects out there? Obviously, there's academic benchmarks.

I think you have the SWE bench and those sorts of things. But as a product, as an offering, how do you think and measure kind of that performance over time? What right now is performing very well and maybe where are those areas of improvement?

Graham

Yeah, it's a great question. There's a lot to that question, but just about how we are doing benchmarking. Up until recently, we were doing a lot on SWE bench, but we have a very large evaluation harness that actually already has 20 benchmarks incorporated into it by our academic partners. And one thing that we're thinking about doing going forward and are actually kind of in the process of doing is we have identified the common use cases, the ways that people typically use OpenHands, and tried to identify benchmarks that reflect these use cases and then do a more balanced benchmarking strategy across these. So we have some pretty exciting results about things like web navigation and web information gathering, which is really, really important for if you want to function in an environment where you have lots of docs or learn about a new library or do data processing, data science related tasks.

And then we're also doing things like making sure that you can fix broken commits. So you have a pull request that has failing tests and merge conflicts, and can you merge that in? And this is something developers hate to do but need to do all the time. So this is something we're putting a lot of effort into making sure we're good at, and we have some good results about that that we hope to release soon. And then other things like test generation, version updates, things like this.

The academic world is large, so it turns out there are benchmarks for almost all of these that have already been created by some institutions somewhere in the world. And so very often we talk to these institutions and say, Hey, do you want to contribute this into our evaluation harness? Often the answer is yes because they did their work for a reason. They want it to be used. So we're using that as a way to expand our vision of benchmarking to cover the actual use cases that the users are most interested in.

Chris

Well, as we start to wrap up here, one of the things that we really like to do is kind of get a sense of the future going forward. And with both of you here, I'd like to ask the same question of each of you and get each of your takes for a little bit of diversity on on how you're seeing the thing. But as as you, you've kind of introduced us into this kind of new way of thinking about development going forward, and what's possible. For old guys like me, it takes, it's definitely changing how I think about development. And this is moving really, really fast right now.

And you know, it's accelerating. I'd love to understand how each of you sees the future both in the space itself in terms of, you know, changing the world in terms of developer workflows, and your role in that process, as an organization and as an open source community, how you see those going forward. I'll let you guys decide who wants to go at it first, but would love to hear each of your perspectives.

Robert

Yeah, I think the thing that that's really exciting for me is the idea of bringing the next, you know, billion developers into the fold. You know, when I when I first started learning to code, I felt like a wizard. Like, could just all of a sudden make my computer do anything, and I could build all sorts of different applications. And I was, you know, a baby engineer. I was building all sorts of nonsense.

And but I just I felt so powerful and so excited. And then that, like, that fades over time and, you know, it becomes a job. And then I would say over the last, you know, year or two, I've got that excitement again. I feel I feel like a wizard again. I can get so much done, you know, using large language models and using using agents.

And so I'm really excited to bring that feeling to like a whole a whole new tranche of people who have maybe had ideas for software that they want, for, you know, workflows that they want, for applications that they'd like to have and just haven't been able to like bring them to life. And I think it's really exciting that they'll they'll be able to do that. I think there's a lot of questions as to like how we enable them. Like, you know, my mom definitely has some really cool ideas, but she has no business like monitoring a production database. And so I think we're gonna need to rethink like how infrastructure works and how we ship applications and things like that.

I think there's a lot of thought that is gonna need to go into that. And I'm really excited to see kind of what what shakes out.

Graham

Yeah. I love Robert's answer, from a completely different angle. One of the things I have in my introductory slides to a presentation I give about coding agents is looking at the Nobel Prize winners from last year in physics and chemistry. And the Nobel Prize winners in physics were people like Jeff Hinton and the ones in chemistry were people like Demis Hussibis. And these are obviously the top awards in areas other than computing.

And I'm building agents to create software. But the reason why I'm building agents to create software is not because software is the end. It's because software is a means to an end. And I think AI has a huge possibility to increase the impact and the human condition and things like this. But I think the way it's going to do that is through software, basically.

And so if we can make it very easy to effectively create software and make it very accessible to the people who want to use it, we'll be able to make great strides forward. So that's what I'm most excited about.

Daniel

Awesome. Well, we're definitely excited to see what you all are doing. It's it's amazing work and really just encouraged to hear also your perspective, on the on the project and the way in which you're building. I encourage all of our listeners to go and check out all-hands.dev. Check it out. Try it out. And and, yeah, thank you both for joining and taking time. It's been great.

Graham

Thanks for having us. Thanks so much.

Jerod

All right. That is our show for this week. If you haven't checked out our Changelog newsletter, head to changelog.com/news. There you'll find 29 reasons. Yes. 29 reasons why you should subscribe. I'll tell you reason number 17. You might actually start looking forward to Mondays.

Graham

Sounds like somebody's got a case of the Mondays.

Jerod

28 more reasons are waiting for you at changelog.com/news. Thanks again to our partners at fly.io to Brakemaster Cylinder for the Beats and to you for listening. That is all for now, but we'll talk to you again next time.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android
Open in Metacast