How AI Solved a Biological Mystery

Speaker 1

00:15

Pushkin. Humans are made of proteins. Proteins are key components of our cells and of our muscles. Proteins regulate gene expression and the immune system. And yet forever we had no idea what most proteins look like, and this was

00:37

a problem. Every protein has a different shape, and to understand how any particular protein works, how it interacts with other molecules, how it keeps us healthy or causes disease, it is very helpful to understand what that particular shape is, but determining that complicated three dimensional shape was really hard. Scientists sometimes spent years trying to determine the shape of

01:05

a single protein. So people started to dream this. They thought, what if we could come up with some kind of a system, some way to use a protein sequence of amino acids to reliably predict that protein's unique three dimensional shape. It would be a huge leap forward that could lead to a much deeper understanding of biology and a new wave of treatments for disease. This idea was called the

01:32

protein folding problem. But after decades of work, the best protein folding models were nowhere near good enough to be scientifically useful. And then in twenty twenty a group of researchers built an AI model to try to tackle the protein folding problem, and their model was so much better than what had come before that some people thought they were cheating. In fact, they were not cheating. They had

01:58

solved the protein folding problem. I'm Jacob Goldstein, and this is What's Your Problem, the show where I talk to people who are trying to make technological progress. My guest today is push Meat Coli. He's vice president of research at deep Mind, an AI research group that's part of Google. Push Me was part of the deep Mind team that

02:18

solved the protein folding problem. They built an AI model called alpha fold, and alpha fold is one of the most impressive real world AI success stories that we have seen so far, and as you'll hear in our conversation, alpha fold holds lessons for AI that go beyond protein folding. One other thing you may hear in our conversation, by the way, is the occasional background beep from push meat smoke detector. Have you failed to change the battery in

02:52

your smoke detector? Is that perhaps what that little chirp was? You really should do that for yourself as.

Speaker 2

02:57

Well as I know, I mean, I was trying to fix it before the recording, and like it's just so finicky, it doesn't come.

Speaker 1

03:06

Out okay, fair, fair, well with it. It makes you real. And so it is the case with protein folding that it's not just like people publishing papers. Right, There was actually this contest that was held was it every couple of years of exactlys trying to solve the protein folding problem? And this had been going on for a long time, And so is there a first moment when you compete in this contest?

Speaker 2

03:33

Yeah? So this is like an amazing thing about protein

03:35

structure prediction and how visionary the community was. Like they had set up this amazing sort of fool proof way to evaluate progress, because it's very easy to sort of for sometimes scientists to fool themselves saying oh that's progress, there is no progress, right, So they had set up this contest in a very remarkable way where they had said, well, for a specific duration of time, any sort of scientists all over the world who are discovering new structures would

04:02

not share them with the world. Instead, instead, they will basically send it to a secret want script vault.

Speaker 1

04:09

I love a secret fault.

Speaker 2

04:11

Yeah, yeah, I mean not like secret.

Speaker 1

04:15

I wanted to be a big vault with the wheel you turn, but yeah, I understand.

Speaker 2

04:19

Yeah, And so the idea there was that nobody except the scientists who discovered the structure knew what is the structure of this protein.

Speaker 1

04:29

So it's a perfect way to test these models doing the prediction because somebody knows the answer. But the people building the models, people like you, people like deep mind, don't actually know the answer. So you can't cheat, you can't backfit it or anything like.

Speaker 2

04:43

That exactly, right, So you don't know how good you are because like you've been training on known examples and you've been evaluating them on known examples. But when you are tested, you are tested on these amazingly new things that nobody has seen before.

Speaker 1

04:57

Yeah. Yeah, so okay. And they actually have like a numeric score that they assigned to everybody's model, right, yeah, Like it's very quantitative and it's not just like good or pretty good. It's a number, right, And what is it? Zero two one hundred?

Speaker 2

05:13

Is that the scale? Yeah, it's zero two hundred, Like that's the sort of scale. And if you look at progress in the last sort of twenty years before alpha fold one was launched. I mean it was somewhere sort of between the twenty five to forty sort of GDT sort of.

Speaker 1

05:27

Five to forty. Was it getting better slowly or was it just kind of stuck in the thirties more or less?

Speaker 2

05:33

Yeah, it was stagnating. It was like sometimes it would go to thirty eight, sometimes thirty five, and it was like in that so it was going up and down, up and down. There was no sort of remarkable sort of breakthrough.

Speaker 1

05:43

And was it all AI was like the only way people were trying to solve it? AI? Were there whole other sort of things people were thinking about trying to do.

Speaker 2

05:51

Yeah, so this is mostly not AI based solutions, right, These were sort of very well designed, hand designed systems that were carefully tuned to the problem over many, many decades with large teams working together and so on. But it was a little machine learning.

Speaker 1

06:11

So they'd been scoring in the thirties more or less. And then what year is it that deep mind? You and deep Mind show up with alpha fourth one?

Speaker 2

06:20

So twenty eighteen. So the contest actually it runs in the summer and.

Speaker 1

06:26

Then at the end of the summer they sort of give you the results or what so.

Speaker 2

06:30

At the end of the summer. I mean like by by I think July or August, they have sent the last files and then you have sent them the results and then you wait, right and you don't know what has happened. Then they invite you to a conference which happens in December, so you are like eagerly waiting what's what has happened? And they're like, oh, maybe we came last, maybe maybe we're in the middle. And then they actually revealed the leaderboard or the scores in the conference.

Speaker 1

06:59

Where were you at the time.

Speaker 2

07:01

I was in London, I was in the office. I was like really waiting, like trying to figure out like what where where were we right in terms of the career? How did we form? And we get an email from the organizers one day' hold the results are about two thounds and they say, well you are first and by a big margin, so like from thirty to forty we had gone to more than sixty.

Speaker 1

07:25

You did way better at predicting the structure of protein than anyone had ever done, including that.

Speaker 2

07:30

Yes you won by a lot, Yes we won by a lot.

Speaker 1

07:33

So that's way better than anyone has ever done. But does it mean that you're basically half right, is that what that number means?

Speaker 2

07:41

Yeah, so I think the way we I mean we are you're the best in the world, but still your predictions are pretty much sort of not very useful for any like if you're trying to figure out what a drug buying to this particular protein or like, the error is so much right that you wouldn't get a complete picture of the protein.

Speaker 1

08:02

So this sixty number, it's like good in that it's better than anyone has ever done. It's bad in that it's not scientifically useful. What number do you have to get to to be scientifically useful?

Speaker 2

08:14

Like between eighty five to ninety that's what people told us that if you get beyond eighty five to ninety, then the problem is solved.

Speaker 1

08:23

So what do you decide when you when you get this result.

Speaker 2

08:29

So we get this result and we're like, yeah, this is amazing, right, that we are the best in the world, right by a big margin. Right, So like the thesis that machine learning sort of will advanced science. Oh that's great, but the problem is not solved. And let's go back to the drawing board. And now with the information that we have in the amount of time, we have spent on this previous architecture, do we still think that this

08:52

will lead us to where we want to go? And the teams thought, no, we need no completely, Yeah, we need to completely start from scratch.

Speaker 1

09:02

So your your reaction to winning this contest and doing better than anyone has ever done it predicting protein folding is let's blow up this thing that just won the contest.

Speaker 2

09:13

Yeah, throw it away. Yeah, we were like, the basic premise was proved that machine learning has a role to play, right, So that gave us a lot of confidence. But at the same time we saw it, well, this is not an elegant This is not an elegant solution, right. This is this is like two modules, Like there's machine learning, there's a machine learning module. It is making these sort of predictions which this other module is sort of trying

09:37

to use. If you believe in the power of machine learning, let's do end to end, right, Let's do end to end and basically do everything so that the model takes care of it, right, rather.

Speaker 1

09:47

Than clear what was happening with that initial model that you were deciding to abandon, that.

Speaker 2

09:53

Was basically using the machine learning model in together with sort of a known framework. Right that there is a second step that was a conventional sort of step.

Speaker 1

10:04

Oh I see. So it's like you weren't all in on machine learning. You were like, well, we gonna use machine learning, but we're gonna still do this kind of the way other people have been doing it. And your response to that first result was screw it, Let's not do anything like everybody's not before. Let's just go all in on machine learning beginning to end exactly interesting. And so you do that, and you spend what two years? Is it two years between? Do I have that right?

Speaker 2

10:31

Yeah?

Speaker 1

10:31

Yeah, and then you come back in twenty twenty. You come back in twenty twenty, there's another one of these contests. Yeah, you got your new end end machine learning model.

Speaker 2

10:42

Yeah. So it was the pandemic. So this is twenty twenty, right.

Speaker 1

10:46

So nobody's gone anywhere.

Speaker 2

10:48

Yeah, nobody's going anywhere. Right. We knew like twenty nineteen was basically where we started working with this new model, and it was really tough going because we were like starting from we're starting from twenty, right, so we went at sixty. Now we are starting from twenty and twenty thirty forty. And sometimes you would stagnate at forty five fifty you were like, really should I should I had that figure model?

Speaker 1

11:13

Yeah?

Speaker 2

11:14

Yeah, So twenty by the end of twenty nineteen thought we started getting some really really cool results and we thought, okay, now we have surpassed we have definitely surpassed the previous model. We're in good territory. And we were very excited. Like at the start of twenty twenty, we were like, yeah, making progress, and then the pandemic hit.

Speaker 1

11:38

In a minute, the model gets an unanticipated test in the real world.

Speaker 2

11:47

There was this new virus that was reported sarskov two and one of the first things so somebody sort of figured out the structure of the spike protein. It was all over the newspapers, like, here's a spike protein of this new virus, but all the other proteins of the virus, the accessory proteins, nobody knew the structure. So the first we did we thought, I think we think we have the best model in the world. We should be making these predictions and sharing it with the world, but is

12:17

this the right thing to do. So we spent a lot of time reaching out to biologists who looked at the prediction and said well, you need to share this, you need to share this with the world. So the start of twenty twenty was us sort of sharing the predictions from this untested model with the world because we thought they were quite good. And then throughout twenty twenty we took part in the assessment right, which ran in the summer of twenty the contest.

Speaker 1

12:46

In the contest.

Speaker 2

12:47

Exactly normally, right, the organizers don't come back to you. They just released the results at the end in December, And at the end of the summer we get this funny email saying we want to talk to you, and so we were like, yeah, like, did we do anything bad? What happened? And a few of them really had sort of suspicions. They were like, you must have cheated, right, like that your predictions are your level of performance is

13:19

nowhere close to anything that we have seen ever. Right, But a few scientists in that contest had submitted a sequence a protein whose structure was not known. They were expecting that the structure would be known by the time the contest en so we'll be able to evaluate the predictions. But that structure was not known, and in fact, the structure they couldn't find. The structure.

Speaker 1

13:41

So you're saying it would be impossible to cheat because literally no human knew the structure. No way to cheat, nobody knows the answer.

Speaker 2

13:49

Yeah, yeah, So they used the prediction of alpha fold and then tried to explain their experimental data and it matched. And they are like, this model has been able to discover something that nobody knew, not even no scientists knew. Sense the model had already made new biological discoveries even before we knew it.

Speaker 1

14:14

Yeah, yeah, okay, so that's good. You're not in trouble anymore. It's clear you didn't cheat. Do they say the number? What's the number? I'm waiting for the number. How'd you do?

Speaker 2

14:26

Yeah, so we were beyond eighty five and ninety right, and then they basically said, okay, we have to announce it to the world. And so come December that was the announcement that was made by the organizers that this alpha fold too had solved the protein structure prediction problem.

Speaker 1

14:43

So is that contest done? Now? Did you just end that contest? Is nobody doing that anymore?

Speaker 2

14:48

No, the contest sort of is alive, right, it has changed, its focus has changed. So what what alpha fold two did was find the structure of these single proteins, But there are many other problems that remain, right, how do multiple proteins interact? Instance, Right, So there are other structure predictions, problems that now the contest is sort of has evolved to, right, it is sort of focusing on other types of problems that Alpha fold two did not address.

Speaker 1

15:18

If we zoom out and think about what you have done, what the team has done in you know, using machine learning to solve this scientific problem that people had been working on for a long time, Like what are the broader lessons? Like if we think about other domains, what can we infer what can we take from this?

Speaker 2

15:38

Yeah, so I think the thing that we can take from this is basically science is sort of generating a lot of data across any domain that you see, right, whether it's genomics, hydergy, physics, whether the amount of data that we are gathering about the world is much more than any single human mind can comprehend.

Speaker 1

15:56

Right.

Speaker 2

15:56

You can have the best scientists and they will not be able to sort of go through on the data that we are collecting about our world. So machine learning is this remarkable sort of tool which gives us the ability to make sense and leverage this data, right and really sort of on the path of really accelerating our understanding of the problems that we're dealing with.

Speaker 1

16:16

In the case of alpha fold, was the sort of input data the known protein structures and amino acid sequence and was that the basic training data exactly right?

Speaker 2

16:27

So it was the PDB, which was the protein database, and that had been collected by the community for many, many years, right over many decades. They have meticulously carefully deposited all the protein sequences and the corresponding structures that were discovered, right, And it had one hundred and fifty thousand examples at that time, right, sequences as well as structures, and everyone had access to the same data. Right, All the teams were training on that data.

Speaker 1

16:57

Is it right that alpha fold itself is open sourced and that there's this open source database of protein structures that have been discovered with alpha fauld? Is that right?

Speaker 2

17:07

Yeah? So when the sort of developed alpha fold, we made it available to the world. But we then said, well, it's so accurate, but it's also so fast that we will use it to find the structure for every sort of known protein. And then we made all those structures available to the world.

Speaker 1

17:30

Alpha fold has now made the structures of roughly two hundred and fifty million different proteins publicly available. We'll be back in a minute with the lightning ground. Last thing is a lightning round. Just some fast questions, okay, and then we'll be done. What's your favorite protein?

Speaker 2

17:51

Himoglobin?

Speaker 1

17:53

Why?

Speaker 2

17:55

It is sort of very pleasant to look at it. It is very symmetric, it has there and you can see it's purpose right that and the oxygen binds into that thing right from very clean protein.

Speaker 1

18:04

It's so easy to understand. It's the little thing that carries oxygen around your body. If everything goes well, what problem will you be trying to solve in say, five years.

Speaker 2

18:14

Really sort of thinking about the two big challenges sort of that humanity is facing. One is the pandemic, the other is climate change. And I think material science and quantum chemistry can impact both, but especially climate change. And I think this is something that requires a lot of work.

Speaker 1

18:32

Is there some particular problem in that domain that is analogous to protein folding? Is there some hard thing that you want to figure.

Speaker 2

18:41

Out rational material design? We are very far from there. We are still basically doing experimental stuff when we think about discovering new materials.

Speaker 1

18:53

What do you understand about AI or machine learning that most people don't understand.

Speaker 2

18:59

I think sort of AI is not magic, right, it's sort of essentially it's a series of techniques which is able to extract intelligence. But you extract intelligence from the raw material, right, So so garbage and garbage out. So what is really important is that experience need needs to be rich enough. Right, We can't just we don't become intelligent by sitting in the room, right. We become intelligent

19:27

because we have amazing experiences. So it's not big data, right, it's not the bigness of the experience, but it's like the goodness of the experience, like the wide variety of sort of things that you train on and the things that you see. So I think that's very really that's really important.

Speaker 1

19:46

That thought leads you to like the optimal training data. So it's the worry that like people are making a mistake by just doing a lot of the same kind of training data.

Speaker 2

19:57

Yeah, exactly, exactly right. So if you just take one example, you repeat it multiple times, right, that's not that's not great. Again, you don't become Yeah, you don't become wise doing the same thing again and again and again.

Speaker 1

20:09

Right, what are you actually working on right now? Like what are you going to go work on today or next week.

Speaker 2

20:16

So there is a system that my team developed called Synthide, which is a system for watermarking AA generated content. So we want to be able to detect it. When you have a generated content, users should be able to detect that this is educated.

Speaker 1

20:33

Generated content, whether it's images or words or whatever, text, video.

Speaker 2

20:39

Exactly exactly. You embed this imperceptible thing within the thing that is generated that a human might not see.

Speaker 1

20:49

So the builder of the AI model Open AI could choose to embed a watermark in GPT, so that anybody who made a thing with GPT, that document would have some hidden sign that it was AI generated. It's sort of the choice of the model of builders. Yeah, thank you very much for your time. It was great to talk with you.

Speaker 2

21:10

Yeah, thanks you good. It was a pleasure.

Speaker 1

21:18

Pushmikkoli is vice president of research at Google deep Mine. Today's show was produced by Edith Russello and edited by Karen Chakerje. You can email us at problem at Pushkin dot FM. I'm Jacob Goldstein.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript