Using AI to Build Better Robots - podcast episode cover

Using AI to Build Better Robots

Feb 01, 202435 minSeason 1Ep. 83
--:--
--:--
Listen in podcast apps:

Episode description

Peter Chen is the co-founder and CEO of Covariant. Peter’s problem is this: How do you take the AI breakthroughs of the past decade or so, and make them work in robots? Peter was one of the first employees at OpenAI, the maker of ChatGPT. On the show, he talks about how AI has evolved, and why it's so difficult to teach a robot to fold a towel.

See omnystudio.com/listener for privacy information.

Transcript

Speaker 1

Pushkin for a long time. Now, we've had a lot of technological innovation in virtual things in bits, you know, the Internet, digital images, large language models, etc. We have had noticeably less innovation in actual things and things made of atoms, things that would hurt if you dropped them on your foot. Now that seems to be changing. People are using innovations in bits, improvements in computing and communications and AI to drive innovation in actual things, everything from

batteries to garbage cans to airplanes. Next up robots. I'm Jacob Boldstein and this is What's Your Problem, the show where I talk to people who are trying to make technological progress. My guest today is Peter Chi. He's the co founder and CEO of Covariant. Peter's work at Covariant was partly inspired by the work of Fafe Lead, who coincidentally is the AI researcher I interviewed just last week

on the show. Peter's problem is this, how do you take the AI breakthroughs of the past decade or so and make them work in robots.

Speaker 2

So to really tell the story of robotics, like we have to tell the story of robotics even without AI like robotics for a very long time. It's a field that you would actually find in mechanical engineering departments of universities. Like it's largely a hardware problem. It's a control problem, like how can you design the moti well, how can you design the gearbox? Well?

Speaker 1

Yeah, right?

Speaker 2

Can you design like the control algorithm so that you can get the robot to a exact xyz location in the three D physical world like without oscillating around and you can.

Speaker 1

Making the thing move? How do you build the parts that make the thing move the way we.

Speaker 2

Wanted to move exactly? Like it's all about like telling this piece of machinery that we call robot to do the thing that's exactly what we tell it to do, which turned out to be like obviously a fairly difficult engineering problems, and that's why people have woke on it for many decades. But it has gotten really good.

Speaker 1

And so this is like this is like the classic kind of image you see from a car assembly line of like a robot arm you know whatever, welding a part onto the body of a cart again and again again all day long.

Speaker 2

Yeah, exactly right.

Speaker 1

So they're good at robots are clearly good at welding the same part onto the same car a million times? What are the limits of that approach? What were the problems people were bumping up against.

Speaker 2

Yeah, so the problem is that in order to use that kind of robotics, it has a really big limitations on your environment. Right. You basically need to be able to reduce your task to be solvable by repeated motion. Right, And so if you look at like, how like these kind of assembly lines that use classical robots, they always

feed the material into exactly the same place. So no matter how how the way that they came in from their suppliers and whatnot, you always need to load them up in exactly the same way because like there's really no adaptivity at all that these robots have because they're just executing the same thing again, and it.

Speaker 1

Just it has to be they're very precise, but their whole environment has to be super homogeneous, the same every time exactly.

Speaker 2

Like, So, like that's the problem one, but that's very difficult, Like,

not everything can be reduced that way. The second problem with it is even in the case that you can reduce the problem to that kind of pure mechanical repeated motion, it's still very expensive because you still need to program a robot to do that one specific task, and if you change your task slightly, you need to reprogram everything, typically from scratch, and that means like robot are not just extremely rigid, which limits like the range of capabilities

that they can rich and do. And it's also very expensive even on the very fixed rigid capability will.

Speaker 1

Right, And so you need something that's the same every time, and you need to be doing a lot of it because otherwise the economy of scale just doesn't work out. It's too expensive to try and get the robot to do something else.

Speaker 2

Exactly.

Speaker 1

So I know that you as a as a student, as an undergrad and a grad student, if I have it right, you worked in the lab of this professor at Berkeley who for a long time had been trying to teach robots to fold towels. Yes, which is an amazing problem because it's one of those ones that seems so simple, right, it seems like way easier than riveting parts onto a car or whatever, but turned out to

be in fact much harder for robots, right. And I feel like that's telling like why was that so hard for robots?

Speaker 2

And what do we learn from that when you do welding on a car body, like as we have discussed, like you can reduce the problem to just simple mechanical repeated motion. But because like these a piece of fabric is flexible, is deformable, like it can come in many many different kinds of shapes. It has many different possibilities, right, and.

Speaker 1

It's much more complex than a than a car body. Weirdly not into exactly, but when you think about it's like, oh it could have a little fast possibilities.

Speaker 2

It has a lot more possibility and how it can present itself to you. And because exactly of that, like recall like the first limitation of traditional robotics, which is it can only work with problems that can be reduced to repeated motion like and towel folding and folding apparel items is exactly one of those things that cannot be reduced because.

Speaker 1

It's just a little bit different every time it tells a little bit of a different shape. It might be sitting on the table and it'll folded over some.

Speaker 2

Weird exactly, like if it's folded onto itself, like how much is it folded onto yourself? Like how much wrinkle does it have? Like all of those make a big difference in terms of what the robots should do with it, and so that like it's a really good example of something that's traditional robotics cannot solve and you really need AI to solve it.

Speaker 1

And when your co founder started working on the problem, it was sort of before the kind of modern era of AI that we're living in now, right, And I read that there was this big moment in kind of the origin of the company was when this database essentially of labeled images called image net was released, Which was interesting to me because I just talk to Faye Lee for this.

Speaker 2

So we have is actually one of the advisor investors.

Speaker 1

Just the coincidence. For the record, she didn't put me in touch with you, but tell me about sort of why why image net was meaningful in the in the birth of your company.

Speaker 2

Yeah, there was actually a bigger lesson here than just robotics. The bigger lesson is that artificial intelligence is actually becoming simpler and simpler, Like when you look at the field of artificial intelligence fifteen twenty years ago, like it used

to be many different sub fields. Like the people that work on robotics have a completely different tool sets than the people that work on computer vision, and the people that are working on computer vision had a different two sets from people that are working on natural language processing.

Speaker 1

Well, and they feel different, Right, Like, teaching computers to understand an image, you know, to sort of deal with an image of the world and understand what it means, feels quite different than teaching a computer to have a conversation. It doesn't. It's not obvious that you would use the same tools to do exactly.

Speaker 2

Those and that definitely was the consensus, Like why.

Speaker 1

Would it be the same?

Speaker 2

Yeah, what would it be the same? Like it feels very different, Like it feels like you need different kind of data, and I would just like say it. Essentially, the field of AI is becoming more and more unified, Like the methodology, the model that you would use is actually becoming similar and similar and sometimes it's even the same across these very different fields of robotics, computer vision, language.

Speaker 1

So it's basically, you build a neural network and then you just train it on a bunch of images or train it on a bunch of documents, and what you train it on is what determines sort of what it's good for.

Speaker 3

Is it like that that that basically is it like it's it's like it instead of each sub feel coming up with different and think about it as like hand programmed intelligence, right, like let's try to break down what's a sentence means, like break into different parts and when you approach a computer vision problem, or let's try to come up with different features and some features represent an edge, some features represent the background.

Speaker 2

Instead of like trying to manually program the kind of quite human intelligence into AI, you're basically taking a step back and say, I'm just going to create a very flexible learning mechanism, which is an artificial new net, and then we're just going to feed it a lot of data. And if you have different types of problems that you're solving, you're just feeding like this new net different types of data, but they're still like really the same kind of mechanism.

Like then that is a drastic departure from how artificial intelligence used to be done.

Speaker 1

And so in that world, then the differentiation is just in the data set that you are feeding the totally model.

Speaker 2

Totally and in fact, like I mean, this is like jumping way forward in time. We were still talking about image net. But if you look at really the most popular technologies today, these large language models. When you use different companies large language models, it's really what you're saying, like as if I'm using GPT chat, gipt four versus using Google's bought backed by Germini versus Anthpics, Claude or coheres command model.

Speaker 1

You're just naming all the different big language large language models. Now.

Speaker 2

Yeah, and like when you think about these different big language models, you're just really referring to the different data sets that's see behind.

Speaker 1

That they were trained on. So this is interesting kind of abstract big picture talk. I want to kind of map this now onto the story of Covariate, the story of your company. Right, so we're going back in time. Now, tell me about the moment when you decided to start the company. What's going on?

Speaker 2

Yeah, so the moment that we decide to start a company is you'll refer to this image net mode like this like large data set that actually first time like really taught people that you can train a network to solve one specific task really work.

Speaker 1

So this is twenty twelve. There's when a neural net trains on image net and does a really good job, way better than anybody has ever done. Any Then any model has ever done of identifying objects.

Speaker 2

In exactly right, exactly. That was really significant because that means if you can collect a lot of data for a single task, and if you can get a group of PhDs to work on a model for that single task, you basically have AI. Like you can solve that task really well. I mean like you might need to iterate on your data, you might need to iterate on your algorithm, but ultimately you can solve that task really well.

Speaker 1

You're basically saying, if you can gather the data, you can gather a shitload of data of whatever kind, then you can get AI around.

Speaker 2

For that for that kind, right, And which is like why you saw artificial intelligence like really started working like after twenty twelve, Like like Google, Facebook, all of these companies like have a lot of AI based applications, but they are largely not democratized because like in order to get any single AI working, you still need a lot of data and you still need a team of PhDs

to work on it. So it was a huge breakthrough in AI, but it was not sufficient for really widespread usage just because the barrier to create one AI is so high. Okay, and then comes the second inflection point in the history of AI, which is really the start

of foundation models. So like I'm talking about the most initial version of GPT, right, Like I'm talking about like these large language models that are trained on multiple tasks so that they're incredibly generalizable, Like you can ask to do something new and it can do it really well. And also it performed better at single task than specialized model.

Speaker 1

And so just to be clear, like until a few years ago, people thought reasonably that if you want to build AI to whatever translate language right, you would work really hard on that. You would try and build an AI specifically designed to be really good at translating text from one language to another. But this really surprising result, this really surprising thing that emerged from just work people were doing, was, in fact, that's not the best way

to get AI to translate language. It's just throw everything you can, all the words on the internet at an AI model and just say, figure out everything about language, figure out how to answer questions about history, and figure out how to translate, and figure out how to give me a recipe for you know, pasta. And it turns out that LAD technique gets you better results at each specific thing than trying to build specialized model exactly.

Speaker 2

And that is really the magic of foundation models. And that's the thing that we're not obvious to people outside of open Ai for a very long time. And because we came from open Ai, a lot of the founding team at Coverin came from open Ai. We saw that inside earlier, and that inside allow us to start Covariant to build foundation models for robotics way before other people even believed in the.

Speaker 1

Approach, even people in the field, even people in the field. So so, yeah, so you were when did you go to open Ai? You were you went to work at open Ai.

Speaker 2

We went to open Ai when it was about ten ish people like like sometime in twenty sixteen.

Speaker 1

Okay, And and when do you sort of personally have this realization and you're not the only one to have it, but when do you see that the power of foundation models?

Speaker 2

There are two things on it, Like, so the first thing is that early on at open Ai, we believed in the idea of scaling, like really scaling up the model and scaling up the data sets. And you actually see models getting like getting increasingly smarter as you actually scale them up. So one is that, and then the other one is I would say we we had conviction in foundation model for robotics probably earlier than foundation model

for language. And this is like the one key thing is that if you think about building a large language model that tried to compress the whole internet of knowledge, you still need to compress many things that are not quite related to each other. Right, Like maybe you're browsing on Wikipedia and you have to recite the composition of materials of soil on the moon, and you also need to learn how to play chess. What there is there's

really nothing in common with these two things. Yeah, these two parts of the knowledge, but you are asking one AI model to learn all of these. And the place that make a lot of sense to us is that there's only one physical world. Like even when you have many different robots that need to do different things in different factories, different warehouses, they are still interacting in the

same physical world. And so building a foundation model for robotics like has this amazing property of grounding that like, no matter what kind of tasks that you're asking this foundation model to learn, well, it's just learning with the same sets of physics, Like the same surrounding.

Speaker 1

Is the literal ground with the models.

Speaker 2

The literal ground exactly, and.

Speaker 1

The models have to understand just how the physical world works, and that you drop a thing, it will fall and etc.

Speaker 2

And it's if it's something that is like deformable, you push it, it would move a little bit. If something is rigid like you would sly, if something is rollable, it would roll away.

Speaker 1

Like.

Speaker 2

These are the type of things that like, no matter where you are on Earth and what type of robots like body you're using, is the same, right, and so like if you can be one single foundation model that can learn from all of these different data, it would be incredibly powerful.

Speaker 1

So just to just to state it clearly, So you're at open AI, you're seeing the power of foundation models, you decide to leave and start the company that is now Covariant, Like what are you setting out to do when you when you start the company?

Speaker 2

Yeah, So when we started Covariant, we had this really strong conviction that like, there should be a future that has a lot of autonomous robots doing all the things that are repetitive, injury prone, dangerous, like and so that can really revolutionize the physical world, make it a lot more abundant and to really enable that future of autonomous robots, you need really smart AI, and like because of the inside that we just talk about, like we believe that

AI had to be a foundation model. We believe like you should have single model that learn from all these two different robots together and become smarter together.

Speaker 1

So I mean, so the basic idea is, the dream is to build one AI foundation model for robots. Basically, yeah, in the same way that you can ask chat GBT anything in language and it can answer you in language about any different thing. You have a model where you just sort of make it be the brain of any robot and that robot can sort of see the world and move and pick things up and behave in the world exactly.

Speaker 2

And there's one key problem, Like the key problem is that unlike foundation model for language, where you can scrape the whole Internet of text as your pre training data, there's nothing that is equivalent in the case of robotic I mean, there are some images online, there are some YouTube videos online, but by and large, like they don't really give you the same type of data that are

in the form of robots interacting with the world. And the big problem is because they're just not that many robots, they're just doing interesting things in the world. And so a big chunk of what we set out to build as a company is recognizing that we need to build fundation model for robotics. And in order to be a foundation model for robotics, you need to have large data sets.

And in order to create large data sets, you have to have robots that are actually creating value for customers in production at scale, because if you're only collecting data in your own lab, there's only so much data. And so I would say the last six years of Covariance is largely focused on really building autonomous robot systems that work really well for customers, and they're doing interesting things to a level of autonomy and reliability that have not been hit before.

Speaker 4

In other words, Peter and his colleagues at Covariant built robot arms that businesses are paying to use out in the world. But to some extent, those robot arms are just a means to an end because they're not just doing warehouse work. They're collecting more and more.

Speaker 1

Data that Peter and his colleagues are feeding back into their AI model to try and make it get better and better. In a minute, what those robot arms are actually doing out in the world and what they're learning. So let's talk about what the robots you have built are doing out in the world. Besides generating the data that you will use to train the next generation of robots, what are they actually doing in the world right now today.

Speaker 2

Yeah, So when we started the company, we soveyed the landscape pretty carefully and then we selected warehouse and logics as the primary sector that we focus on today. When you are stopping online shops, like when you click a button and something shows up next day, there's tremendous amount of complexities behind that back end logistics of getting things

to you. And typically it's estimated that each item is touched fifteen to twenty times between when you click a butttom to buy and when it shows up.

Speaker 1

In contrast, that fifteen to twenty touches of getting a thing from the warehouse to your door is your opportunity.

Speaker 2

Exactly right, And like combining with that, people don't want to drive in the middle of the night two hours into a suburb to work in a warehouse, Like it's just a kind of job that has extremely high turnover rate, Like not the kind of job that people stay there for a very long time, and so there's a tremendous amount of desire for more robotics and more automations in those environment to do picking up objects, sorting them into the right compartment of boxes, and then packing it nicely

and then shipping it out to you as a customers.

Speaker 1

Tell me a little bit more specifically, I mean, what's one thing one of your robots is doing today in a warehouse somewhere on the earth.

Speaker 2

Yeah, so we would get like pretty detail and giky here, and so we'd actually tell you a little bit of like the needy, greitty details of like warehouse. Like, so let me describe what the robot is doing. Like the robot is doing, is I have a toad full of items that come up to the robots, and then I would need to grab one thing at a time, like we're.

Speaker 1

Is like a tote bag with a bunch of different stuff.

Speaker 2

And a bunch of different stuff in it, and like because like these stuffs are all layout in a chaotic way and like they're overlapping with each other. If you're not careful, you might drag out multiple items at the same time. And these items all have different shapes like they might be transparent, they might be reflective, they might be hard to see.

Speaker 1

This is hard, like to go back to our old Like riveting parts on a car is easy, Folding a towel is hard. This is hard because it's heterogeneous. Things look different, They come differently every time. This is hard for robots. Hard for a sort of classical robot to do.

Speaker 2

Impossible for a classical impossible.

Speaker 1

Not just hard, impossible. Can your robots do it?

Speaker 2

Yeah? They can do it? How extremely well?

Speaker 1

How did you solve How did you solve it?

Speaker 2

How does it work in the end of the day?

Like it the way that it operates us very similar to how like humans vision system work, Like we have two eyes and then like by two eyes looking at something like we like we can figure out what's the depth of a certain items, like because our two eyes can triangulate a single point the three D world and it's the same kind of mechanisms and so like you can just use multiple regular cameras, just like the one that you have on your iPhone, and by having multiple

ones of those like you give the new net the ability to triangulate what's happening.

Speaker 1

Just the way our two eyes allow us to see depth essentially exactly right, And are there other things like like weight, I mean the arm is going to be picking things up like weight or whether you know, presumably there's like could be a shirt in a plastic bag, could be a box, or something's are rigid, some things are deformable.

Speaker 2

Yeah, So what we have found is that if you just have a visual understanding of the world that is as robust as human, you go a really long way. Right, Like so when I when I pick up a cup, right, like, I'm not doing a lot of calculations on my how is my fingers faced? Like exactly, it gets translated to the cup and making sure it holds right.

Speaker 1

It's part of the miracle of being a person though, Right, it's like a really hard problem to pick.

Speaker 2

It's a very hard problem. But then like your brains subconsciously solve it for you, like your your system one thinking somewhat solved that.

Speaker 1

For thinking, you don't have to think about yeahm.

Speaker 2

Hm, exactly, And and you can like imagine like when you do this, like in fact, like even if my fingers are numb, I can still do this perfectly like just because like so it acquires this intuitive understanding of interaction with physical will so well that you can do it.

Speaker 1

So basically vision, vision gets you most of the way there.

Speaker 2

I would say, vision and then the ability to intuit physics from your visual input that you'll get.

Speaker 1

That told me that second. What is wild though, Like I mean into it, I mean into it in the in the context of it, I mean sort of make inferences. I mean intuit is yeah, okay, yeah.

Speaker 2

But like by by into it, I mean like it's not doing some kind of detail physical calculation.

Speaker 1

It's not doing math. It's not doing math.

Speaker 2

Yeah, it's doing a kind of high level pattern matching of well, like based on how these things looks, this is likely going to be a successful way to approach the item and interact with it.

Speaker 1

What what's next?

Speaker 2

So what is next immediately is very exciting? Right. We are now getting to a place that by we, I mean we as in AI community has gotten to a place that we have enough computation power and algorithmic and modeling understanding that can allow us to extract a lot out of data.

Speaker 1

Right, So from any given amount of data, you can get.

Speaker 2

More, you can get more out of it. Right.

Speaker 1

It's exciting for you because data is such a constraint on what you're trying to do exactly.

Speaker 2

And then like we're building up this large robotic data set by tapping into a lot of these events, is it gives us the ability to even get more out of the data sets that we're building and allow us to build smarter than better robotics foundation models that perform better at the current tasks that they're supposed to do and also power more robots.

Speaker 1

When people talk about concerns around AI, they often kind of have jokingly use the phrase killer robots, which is usually like a metaphor or something. But in your instance, because you are building robots, and because you are building by design a model that is supposed to be used for lots of different purposes, I can in fact very

easily imagine killer robot applications of your work. Like that seems like a very plausible thing someone could do with it, Like, is that something you think about worry about?

Speaker 2

I would say very Fortunately, in the very near to medium term use cases, we are very safe because like all of these industrial robots are very much confined by the stations that they're designed in. Curse like industrial robots heavy machineries that are subject to regulations and are very carefully They are like careful design guidelines and compliance requirements for them. They are already by design safe.

Speaker 1

You're saying your model is built for robots basically for robot arms. I mean, is that essentially what the model you're building is really a foundation model for robot arms that are built to just be in one place, pick things up, put them down, that sort of thing. It's not a foundation model that you could have map onto a car or something, or even to a robot that walks around. It wouldn't work for that.

Speaker 2

That would not be the near term use cases. Like so like because the near term use cases are more in this safe by construction setting, it allows us to not more about that problem and in fact, like have basically no way to misuse the technology in the way

we need to. But I do agree with you, like as we actually unleash this model to a set of use cases, like when these robots can actually interact with the world in a lot more freeform way those other cases, that the safety considerations become a lot more important and there's definitely a lot more work that need to be done for that to be reliable.

Speaker 1

We'll be back in a minute with the lightning round. M there is a lightning round that we're gonna do now for the end of the interview, what household chore do you wish that a robot could do?

Speaker 2

Cleaning up kitchen? Yeah, but I don't like the cleanup of it.

Speaker 1

That seems like a really hard one, like putting stuff away. Basically I'm wiping the counter maybe less hard, but like putting stuff away seems like a really hard job for a robot.

Speaker 2

It is. And these are the type of jobs that start to get to how we are limitation, Like, these are the type of jobs that start to get to like you probably do want a humanoid robot, like you probably do want something that kind of moves and conform to the human standard of interacting with the world.

Speaker 1

Because the kitchen is optimized, right, or you would have to redesign the kitchen for a robot, and but then that would suck because then you couldn't get your plates because they'd be in some random spot or whatever. Okay, so you left open ai in twenty seventeen. In the past year, open ai became like this household word, GPT

became a household word. Were you surprised as a you know, old school former open AI guy, were you surprised by how how wild the world went for GPT or by how good it was how soon?

Speaker 2

I was definitely surprised by the speed of it. I was surprised by the speed of bull of the technologists development and the speed of adoption. But I was not surprised by the fact that it could be dispig and it could be bigger.

Speaker 1

You know, when you were talking about sort of warehouse and getting data from you know, picking and packing basically, I thought, of course, as anyone would, of Amazon, and I've read that they're working on some kind of robot arm I feel like they would just have so much data that they could gather if they wanted, just because they're so big, they have so many warehouse I mean the same way that say Google just gets tons of data every day with every Google search and the way

people like I feel like that would be very hard to compete with, But.

Speaker 2

We also don't need to compete with them. They are also a very large role that Amazon is not serving, and there are a lot of customers that don't have the same degree of engineering team data access as Amazon, and.

Speaker 1

You could be the shop by of you could be the Shopify of warehouse robots.

Speaker 2

There are all of these people that still need help, and we very gladly help them.

Speaker 1

What was the first robot you personally built.

Speaker 2

I think it's probably one of the first pick and place robot e Covariant.

Speaker 1

You didn't build them when you were kid there.

Speaker 2

I didn't build them when I was kid.

Speaker 1

What made you get into robots?

Speaker 2

I would say I'm an AI person first and robot person set, and a big part of the interest in robotics is probably driven by my interests in AI. Like it just like we have just not make as much progress for AI in the physical world. That's AI in the digital world, and to a large degree, like I think we have to make progress there because like ultimately we live in the physical world. Like you're creating all

these intelligence and amazing things in the digital world. That is all great, but where's AI in the physical world? Like this remarkably little progress there despite how much AI has moved forward. And so to some degree, like it's a it's driven by a conviction that like AI has to progress forward and AI will have a large impact in the physical world.

Speaker 1

What do you understand about robots that most people don't.

Speaker 2

I think the most interesting thing about robots that I would say, like we understand a covariant that maybe outside of the company don't is making one robot work is obviously hard and fun, but making a lot of robots work as scale for a lot of customers take a lot of operational discipline. Like it's about like doing many many things right, like before robots go into a facility, like how should they prep the site so the robots

actually work. To ship robots at scale, it's a competencies that requires a lot of operational excellence, and that is something that I would say, like when most people think about robots like big thing about is like this sexy, interesting technology, they don't think about having to nail a thousand steps, a thousand small steps well in order to have robots actually have an impact in the world at scale.

Speaker 1

That's the leap from the academic lab to being a real company selling real products in the world. Great, anything else do you want to talk about?

Speaker 2

No? I enjoyed this conversation.

Speaker 1

Yeah, likewise, thank you. Theater Chen is the co founder and CEO of Covariate. Today's show was produced by Edith Russolo and Gabriel Hunter Chang. It was edited by Karen Chakerji and engineered by Sarah Bugueer. You can email us at at Pushkin dot f M. I'm Jacob Goldstein, and we'll be back next week with another episode of What's Your Problem

Transcript source: Provided by creator in RSS feed: download file