Strachey Lecture: Artificial Intelligence and the Future

00:13

My name is Mike Wooldridge. I'm a professor of computer science and currently head of the Department of Computer Science at the University of Oxford. And I would like to welcome you all to this term Straight G Lecture. The Straight G lectures are the distinguished lectures in computer science that the Department of Computer Science offers. We do not usually host our strategy lectures in the show only in theatre.

00:36

The fact that we are able to do this today is because of the generous support of Oxford Asset Management. We are very, very grateful to Oxford Asset Management who literally have made this event possible on a scale and type that would not have been possible otherwise. So we thank them very much for that and for their continuing, ongoing support. Let me introduce let me say a few words about today's speaker.

01:02

It's an enormous pleasure to be able to welcome Demis Hassabis of Google DeepMind to be this term's straight lecturer. Demis gained his undergraduate degree from Cambridge in computer science. He then went on and was for some time a successful computer games programmer designed a number of games which went on to achieve a degree of success in the marketplace. He did a PhD at UCL in Cognitive Neuroscience, and then in 2009 he was a co-founder of a company called DeepMind.

01:37

Well, all through that time I think it's probably fair to say that he didn't hit the front pages of any newspapers.

01:44

All of that changed in 2014 when DeepMind were acquired by Google for the very un-British sum of, I believe, 400 million, which is not a figure that happens that much in any part of of of British industry and got him and DeepMind on the front pages of the international press and computer science professionals like myself were all agog to see what this company, which I think is probably fair to say, had been operating in stealth mode or something like it for a number of years,

02:14

had to offer, well, we didn't have to wait very long. Very soon after that, the first results became public from DeepMind, and it became clear what Google were interested in. I'm not going to spoil them. It's a show by telling you about those results now, but some very impressive results to do with learning to play video games and being on the front pages of the national press just once or twice was not enough for them.

02:39

They got on the front pages of the international press last year with some incredible achievements in the area of computer programs playing the game of go. And we all now have the opportunity to see Demis speak at a really special time for DeepMind because it's Demis is about to tell us they are heading up towards a competition to play go with some of the world's leading players.

03:02

So we're going to get an insight into a company that's doing remarkable things, one of the most remarkable points in that trajectory. So it's a very great with very great pleasure that I introduce you and welcome you to give this term strategy lecture over to you. Thank you. Good evening, ladies and gentlemen. Good evening, ladies and gentlemen. Welcome to the Sheldon Theatre. Before the lecture begins, would you ensure your mobile phones are switched off?

03:32

I would also remind you that an authorised photography and recording are prohibited in the interest of safety. Would you ensure emergency exits, walkways and window lights to capture personal belongings? Guests in the upper gallery asked to leave using the stairs. The sides are not to use. It is used in the steps. Thank you. Okay. Well, thanks. From The Voice from the sky. So thanks, Mike, for that very generous introduction. It's a huge honour and real pleasure to be giving this lecture.

04:03

We invited you to give this lecture in these auspicious surroundings. So what I'm going to try and do today in my talk is to give you a whirlwind tour of what's happening at the cutting edge of artificial intelligence and then end with some of the latest breakthroughs that we've been doing at DeepMind. And then I'll probably talk a little bit about the bigger picture of artificial intelligence and where I think it's heading to in the future.

04:27

And then we can go into the Q&A. So artificial intelligence. AI is basically the science of making machines smart. Now what DeepMind is we founded in 2010 and as Mike mentioned, we joined Google in 2014 to accelerate our mission. And the way we think about DeepMind is a sort of Apollo program or Apollo program effort for A.I. We have about 200 research scientists and engineers now.

04:56

So I think it's probably one of the biggest collections anywhere in the world of talent focusing around this topic. And not only is this a very ambitious sort of research program, but we also try and think about a new, more efficient and productive way of organising science and scientific research. And in terms of like the environment we've created, we've tried to sort of build a unique environment that's a blend between the best of academia and how academia should function in an ideal world.

05:28

And the best from the top sort of Silicon Valley start-ups. So kind of blue sky thinking from academia and collaborative interdisciplinary research. And then the focus and the energy and buzz and resources that really, really successful start-ups have. We try to fuse this together into a unique environment that's uniquely suited to research. So our mission at DeepMind, we basically articulated, or at least I do in this way.

05:58

So step one, we try and so fundamentally solve intelligence and then step to use that to solve everything else. So, you know, this may seem quite fantastical to you, this step, too, but actually I hope that by the end of this talk, you'll be convinced that it actually naturally follows on from solving step one. So more prosaically, how are we going to attempt to do this? Well, I don't mind. What we're interested in doing is building what we call general purpose learning algorithms.

06:29

So the key things of everything that we do is that our algorithms learn how to master certain tasks. They learn automatically from raw inputs or raw data. They're not pre-programmed or handcrafted in any way. The second important notion that we have is this idea of generality. So this is the sort of the idea that the same system or same set of algorithms can operate out of the box across a wide range of environments and tasks.

07:01

So we call this kind of AI internally DeepMind Artificial General Intelligence, AGI, and the hallmark of AGI is that from the ground up, it's built to be flexible, adaptive and inventive. It can deal gracefully with the unexpected. Now, if we compare that with most A.I. that's out there today, which we term now to distinguish it from AGI, most of the A.I. that you interact with every day is handcrafted and special cased to particular single task.

07:35

And what that often means is, is that these systems are quite brittle. If you do something unexpected or something unexpected happens that the programmers of that system didn't cater for the full time, it will catastrophically fail. And you can see that with things like Siri on your phone, you know, it works fine if you stick to the templates that have been pre-programmed. But as soon as you start going off pace with your conversation, the holes in the algorithms quickly become apparent.

08:04

So still today, probably the greatest achievement, one of the greatest achievements in I was deep blue, beating Garry Kasparov for chess in the late nineties. Of course, this is a huge technical achievement and an absolute watershed moment for A.I. research. But having said that, you know, the question is is was deeply, truly intelligent. And I think even the design is a deep blue. And certainly we would argue that it isn't really.

08:30

And one easy way to see that intuitively is the fact that Deep Blue couldn't even play a strictly much simpler game, like noughts and crosses without being totally reprogrammed from scratch. There was no knowledge in the end, or in the algorithms that deeply was running that would help it play any any other game, let alone do anything else. So I actually came away. I remember this match very distinctly.

08:52

I was studying at Cambridge and I actually came away more impressed by Garry Kasparov minds than the computer, because here was Garry Kasparov able to compete on more or less level terms with this brute of a machine. And yet, of course, Garry can do many other things, speak several languages, drive cars, tie shoelaces. So, you know, in a way, it's quite amazing that the human mind, what the human mind is capable of.

09:19

So instead of this kind of regime, how do we think about artificial intelligence? Well, I would say the core of what we're doing, a DeepMind focus is around what's called reinforcement learning. And that's how we think about intelligence at DeepMind. So just quickly going to explain that with the help of a little simple diagram here, what reinforcement learning is.

09:39

So if we start off with the agent system, the AOC and the agent system finds itself in some kind of environment trying to achieve a goal. Now, that environment could be a real world environment, in which case the agent would be a robot, or it can be a virtual environment, in which case the agent would be an avatar. And in fact, for most of our research, as you'll see, we use virtual environments. Now the agent interacts with the environment in just two ways.

10:05

Firstly, it gets observations through its sensory apparatus. We must use vision currently, but we're starting to think about other modalities. And one of the jobs of the agent system is to build the best possible model of the world out there, the environment out there, just based on these incomplete and noisy observations that it's receiving in real time and in real time, it's got to keep updating that model in the face of new evidence.

10:34

The second job of the agent is once it's built, this model of the world is to use that model to make predictions about what's going to happen next. And if you can make predictions about the world, then you can start planning about what to do. So if you're trying to achieve a goal, the agent will have a set of actions available to it at that moment. And the decision making problem is to pick which action will be the best action to take right now to get you towards your goal.

11:01

And once the agent has decided that based on its model and its planning trajectories, the output executes the action, and that action may or may not make some change to the environment. And then that drives a new observation. And that's really it. That's the heart of reinforcement learning. But although this diagram is very simple, those of you who know about reinforcement learning will understand there's huge complexity hidden behind this simple diagram.

11:27

But we do know that if we could solve all the issues behind reinforcement learning and make this work perfectly, then that would be sufficient for general level, general human level intelligence. And the reason we know that is because biological systems learn using reinforcement learning, including the human brain.

11:45

In fact, there are some seminal studies done in the late nineties on monkeys that showed that the dopamine neurones in the brain implement a form of reinforcement learning called learning. So the set reinforces that it is an end for at the core of what we do at the moment. The second big philosophical thing we committed to at the start of at the founding of DeepMind was this idea of grounded cognition.

12:12

So this is the idea that a true thinking machine has to be grounded in a rich sensory motor reality or data stream. Now when people commit to this sort of sentiment, often they then start working on real robots because after all, real robots are actually situated in the real world. And of course, through their sensory apparatus, they're getting data, real world data.

12:37

But we actually made a different decision on this. We decided to use virtual environments in games, and we think they're the perfect platform, if used correctly, for developing and testing A.I. algorithms. One of the important things you have to do and avoid is that when you use virtual environments, of course, if you want to, you could allow your agent to have access to all kinds of the internal states of the game that it couldn't actually directly sense through its normal sensory apparatus.

13:05

And of course, that's something you have to avoid. Otherwise you'll think that you're making progress with your algorithms, but actually it would be cheating in some way. So we have to be very disciplined about how you allow the interface between the virtual environment and the agent and really treat the agent as if it was a virtual robot, only getting the information that it's that it could be available to it through its sensors.

13:31

Now if you use games like that, then there are many advantages. Of course you can create as much training data as you like. This is very important. When we were a small, independent company and we didn't have access to a lot of data, but it's still vital. Now, even though we're at Google, there's no testing bias.

13:48

One of the biggest things I think, that held back the A.I. research field was that often you'll find the researchers are also the ones that are creating the tests and that can lead to unconscious sort of biases about the sorts of tests that you design. We end up designing tests that subconsciously, at least our algorithms are well suited to. Of course you can.

14:14

If we're talking about virtual agents in virtual environments, we can test thousands, perhaps even millions of these agents systems in parallel. And games are very convenient in that a lot of them have schools or quite easily identifiable goals. So it's very easy to measure incremental progress and how your algorithms are doing when you incrementally improve them. And that's very key for us. Actually, benchmarking is a hugely important thing that we have.

14:43

We have a whole team who works on that because when you've got a very ambitious long term goal, it's even more important to have short term directional sort of waypoints that tell you if you're heading in the right direction towards this big, ambitious, long term goal. So putting all this together, then this brings to sort of the nub the notion of what we call end to end learning agents.

15:08

So this idea of going all the way from the pixels or the raw data input and then ending up with making a decision about what action to take. And I my women should in that entire stack of problems. So everything from perceptual processing to decision making and all the things in between. So our first attempt at doing this, which really scaled to something challenging we call deep reinforcement learning.

15:37

And the essence here was combining deep neural networks, which is called deep learning these days with reinforcement learning. And what this allows reinforced learning to do is to actually scale up to work on very challenging problems until we sort of came up with this paradigm. Reinforcement learning has been around for many decades, but is usually only been used for relatively toy grade well problems. It's been hard to scale it up to something challenging with high dimensional sensory inputs.

16:08

So I'm going to show you a few videos of this agent working. But before I do, I just want to clearly explain what it is you're going to see. So we started off with really the first iconic console, the Atari 2600 from the eighties. This has the benefit of there are hundreds of different classic games, many of which are iconic and everyone will recognise. But it's still quite a challenging sensory data stream. The agents here, they only get the raw pixels from the screen as inputs.

16:40

So that's around 30,000 numbers per frame because the screen is about 200 by 150 pixels in size. And the goal here is to simply maximise the score. The agent system has to learn everything else from scratch, from first principles. It doesn't know what it's controlling. It doesn't know what the object of the game is. It doesn't know what it gets. Hate points. It doesn't even know that pixels next to each other are correlated in time. Has to find all this structure for itself.

17:11

And then there's an additional constraint or requirement we put on the system, which is this idea of generality, again, that a single system has to play all the different games without any changes, with the same hyper parameter settings and other settings. Then I show you a couple of videos. So the first one to show you is Space Invaders, which were two parts to it. One where the agent has had no training.

17:38

So literally the first time it's seen the data stream. And then after a day or two worth of training. So initially, you see, is controlling the green rocket at the bottom of the screen. It's it's losing its three lives immediately because obviously it has no idea at the moment what it's supposed to be doing or even the fact that it's controlling that collection of pixels at the bottom of the screen.

17:59

Now after training by playing the game overnight for 24 hours, you come back and now the system is superhuman level. So I can play Space Invaders better than any human can. So you see here, every single bullet hit something. It's learned that the pink mothership at the top of the screen coming across there now, which she hits for this amazing shot, is is worth the most number of points.

18:23

And as you'll see later, those of you who remember Space Invaders, if you're old enough to remember the less of them there are, the faster they go. So just watch the last sort of predictive shot that that it makes to to get the last one. So, you know, so it's built up these very accurate models, implicit models of what's happening in this game with this data showing. So we show you another video. Now, this breakout, my favourite video.

18:47

So here you control the bat and ball and you're trying to break through this rainbow coloured wall. Now, the beginning of the 100 games, you can see the agent is not very good. It's missing the ball most of the time. But it's time to get the hang of the idea that the bat should go towards the ball. Now, after 300 games, it's about as good as any human can play this. And and it gets pretty much gets the ball back every time, even when it's coming back at very fast vertical angles.

19:13

But then we left. We thought, Well, that's pretty cool, but we left the system playing for another 200 games and it did this amazing thing. It found the optimal strategy was to dig a tunnel around the site and put the ball around the back of the wall. And you see how incredibly accurate, of course, it can send the ball around the back. So so the funny thing about that is that obviously the research is working on this amazing AI developers and programmers,

19:38

but they're not so good at breakouts. And actually they didn't know about that strategy. So they learned something from their own system, which is, you know, pretty funny and quite instructive, I think, about the potential for general A.I. You show final video here of Italy, which is really showing a medley now of many different games. Just to give you a feeling that this system, which we called TQM, really is a general A.I. within the within the constraints of Atari games.

20:08

So here is the same system you just saw playing those other games, playing an early racing game called Enduro. Here is playing a game called River Ride, which is a fighter pilot game that one of the very early 3D games called Battle Zone is a set of classic ponies controlling the green back here. And it wins 21 nil every time. Can't get points off of it. Sea Quest Submarine Game. So you can see the absolute diversity of the graphics and also the objectives.

20:35

So here's a boxing is controlling the red box. So on the on the left does a bit of sparring then once it gets the boat computer on the side just racks up an infinite number of points. It's just happy to just carry on doing that forever. So, you know, so a very, very diverse range of of games. Same system out of the box, mastering these things. Now, if you want to read more about that, that was featured in our Nature article. Beginning of last year we also released the code.

21:04

So if you want to play around with this system yourself, you can as freely available on the Internet. And so we then sort of took that. That was around a year ago. And now we've taken that further and we started looking at 3D games, at using simulators like robot simulators. And eventually we would like to get to start thinking about real robotics. Now just going to quickly show you a couple of 3D videos now with effectively the same deep reinforcement learning system with a few tweaks.

21:35

Now coping with this 3D data stream. So here here is a deacon like algorithm driving a racing car very fast around the track, again, just from raw pixel inputs. So that's how it's learnt and it's driving around about 200 kilometres an hour. And it figures out overtaking manoeuvres. It can also recover from spins, all sorts of things. So, so it has really amazing performance now in these kinds of driving games just from the vision.

22:08

We started looking at how the problems in 3D mazes, collecting objects, finding your way out, remembering where you've got to go, the sort of things that maybe a rodent like a rat would be able to do. And so you can sort of think about where we're going next is trying to build a kind of rat level intelligence. And rats are actually pretty smart. They can do quite a lot of things.

22:32

And so here it is again, just from the visuals on the screen, just from the raw pixels, finding these green apples which are rewarding and then trying to find the exit, which is this little red floating object and efficiently navigating around. So that's where we are on sort of 3D and there'll be big announcements about that later this year. Another. So I've talked about sort of reinforcement learning.

22:59

I've talked about grounded cognition. Another thing that I think is sort of pretty unique to Deepmind's approach to AI is taking systems neuroscience seriously as a source of inspiration for new algorithmic ideas, but also as a kind of validation testing, if you like. So if you have your own pet favourite algorithm or algorithmic technique and the question is, is that you're not sure if this can scale up to become a component of general AI.

23:28

How much effort should you put into that? You know, should you spend five years doing that? Ten years? How many people should be on that working on that? You know, these are very difficult decisions if you're running a lab or a department or a company working on this kind of thing. Now, if you can point to something in the brain and show that like with reinforcement learning and as I said earlier, we know that the brain implements TDW learning through the dopamine system.

23:53

That gives you confidence then that in the limit this has to be sufficient. For example, reinforcement learning. It's not crazy to think about that as a component, a vital component for the general A.I. solution. And that can be very important directionally when you're thinking about four or five year research programs. So systems. But when I say neuroscience, I should be very clear. We we're thinking about systems neuroscience.

24:16

So we mean the algorithms, the representations and the architectures the brain uses rather than something like the Human Brain Project, which is more interested in the low level synaptic implementation. Details of how the brain achieves things with spiking neural networks. That's too low level for us. We're more interested in the computational level of the brain. So we're using many ideas and this new as I haven't got time to go into it today, but here are some of the things we're looking at.

24:42

Memory, attention, concepts, planning, navigation, imagination and all of these areas we're actively researching on right now and have very interesting prototypes on. And in fact, I'll just mention one thing. For my Ph.D., I studied an area of the brain called the hippocampus, and I studied memory and imagination in the human brain with stimuli.

25:02

And it turns out the hippocampus, which is shown here in pink, which is in the centre of your brain, is actually critical for many of these capabilities, especially things like memory, memory, navigation and imagination. So and the hippocampus has very different structure to cortex. So it's quite interesting when people talk about intelligence in the brain, they usually talk about the cerebral cortex.

25:24

But actually there are other structures in the brain that are equally critical to this whole question of intelligence. So now I'm going to talk a little bit about our newest work, AlphaGo. And the reason we took on this project and I'll explain a lot more about what this is in a second, is that AlphaGo really combines pattern recognition with planning. So what you've seen so far with the Atari games is really a kind of stimulus response system.

25:55

So, you know, it's very smart, but it's it's stimulus response. So it learns about how to process Atari screens and generally speaking, what to do in that moment in terms of an action that will maximise its score. But there isn't a lot of long term planning. Now contrast that with a game like go. So go. For those of you that don't know how to play it or don't know what it is, this is a picture of a go board.

26:23

So go is them is the sort of pinnacle of board games is the most complex game pretty much ever devised by man that's played professionally. And the way he's played is that you play on a 19 by 19 grid and there's two sides, black and white, and you put down these pieces called stones, and the stones are placed on the vertices of the board, and black goes first and they take turns placing one stone at a time.

26:54

Once the stones are placed, they don't move. Now they're actually the rules of go incredibly simple. I'm going to teach you how to play go in two slides. But it leads to incredible, profound complexity. That's why it's considered to be one of the most in fact, the most elegant games ever invented. Now a quick history of go for those of you who don't know about it. It originated in China over 3000 years ago. And it has an incredibly rich tradition in Asia.

27:24

So in China, Japan and Korea and other Asian countries, this is what they play instead of chess. But in Asian countries, this is regarded as more than just as a game go sort of elevated to the status of poetry or art. In fact, Confucius wrote about gold and he considered to be one of the four essential arts to be mastered by any true scholar. Japan also has a rich history around gold. And in fact, during the Edo period, sort of 250 years, 1618 hundred annual games are played.

28:00

They were called Castle Games in front of the Shogun. And what would happen is that each tribal clan would send that top go player to play in the castle game for the honour of the whole plan. And and some real legendary players came out of this. There's one guy, Kosciuszko, who won 19 years in a row and has gone down, and legend has the nickname The Invincible. So they were really absolute heroes to in those in that period.

28:31

So it has this incredibly rich history intertwined with the culture of Japan. But it's not just an ancient game. Today, there are over 40 million active players in many of these countries, like Korea, for example. It's taught as part of the school curriculum, and there are specialist go go schools. So if you show talent at go at a young age, then you will go to these guys schools from about the age of ten instead of going to normal school.

29:01

So it's taken very, very seriously. Now, as I was going to say and I'm going to show you in a second, there are just two rules, actually, for go. But the complexities huge that arises out of these very basic rules. One quick, easy way of measuring a sort of illustrating to the complexity is the fact that there are ten to the power, 170 possible board configurations. And in fact, ten to the 700 different possible games. And that's more than the number of atoms in the universe.

29:32

So there's no way that you can solve go through exhaustive search or even play go well through exhaustive search. Brute force search is just too large. So how do you pay go? Well, rule one is called the catch rule. So here is a position from a game of go. And we're just going to zoom in to the bottom right of this board so I can show you how the capture works.

29:58

So let's look at this little part of the board. Now, if you see that white stone there that's surrounded by the three black stones, the vertices, the empty vertices coming out from from each stone adjacent to your where your stone is called liberties. Now, when you run out of liberties, then those stones are removed from the board. So here that white stone that surrounded by the three black stones only has one liberty left that that empty vertex above it.

30:24

So if it's black, smooth and black with to play into that final empty vertex, taking away the lost liberty, then the white stone would be captured. So now that white that white stone has no liberties and would be taken as a prisoner off the board. So that's the capture rule. And you can capture multiple large groups of stones like there's not just one at a time. The second rule is that a repeated board position is not allowed.

30:49

So this is called the rule. So let's imagine. Here's another little zoomed in part of the board. Now, let's imagine it's whites turn so whites could hear. I'll just replicate that board to the right so you can see how this is going to become a repeated board position. Let's imagine it's White's turn. They play here and capture that black stone like I just showed you. Now it's black. Turn now black.

31:13

You might think, Well, they could just play back where that stone was captured and recapture the white stone that was just put down. So you might think, why can't black go there? And this is actually not allowed, because if Black was to go back, that would recapture that white stone. And you'd see now this new position we're in is the same as that original position. So, in fact, that recapture would not be allowed. Black would have to play somewhere else before recapturing.

31:39

And that's it. That's how you play. Go. So the objective of the game is to not only capture your opponent's stones, but also to wall off and surround empty territory, empty vertices. And you can see here, this is a picture of a go board at the end of the game. And you you've just total up the number of the spaces that you've captured. And you add that to the number of stones that you've taken off the board. And the player with the most the highest total is the winner.

32:08

So here's the white territory in the black territory. And in fact, this is a very close game and white wins this game by one point. So that's how you play go. So why is it hard for computers to play? Well, I just is playing. The complexity makes brute force exhaustive, search intractable. And there are two main challenges. The branching factor is huge. So and writing an evaluation function to determine who is winning is thought to be impossible.

32:34

So an evaluation function is a function to tell you whether the black or the white side is winning. And for go, this is very difficult. Let me just unpack that for you by comparing it to the next most complex game chess. So in chess, on an average position, there are 20 possible moves. And that's referred to as the branching factor in go. By contrast, there's around an hour and an average position.

33:00

There's around 200 moves. So the branching factor in go is one order of magnitude bigger than it is for chess. The second issue, which is related to the evaluation function, is that goes really primarily came about intuition rather than brute calculation. If you if you ask a great, good player why they played a certain move, often they'll just tell you it felt right and they'll use those words. Whereas if you ask a great chess player that, they'll never say that.

33:28

They'll tell you exactly the reasons, what the how they calculated that that move was the right move to do. And of course, what we know about computers is, is that when we start using words like intuition. Computers are generally traditionally not good at what we think of as intuition. But of course, they're very good at things that we think of as calculation. So that's one of the challenges of go making computers good at GO is to replicate this kind of intuition that humans use to play.

33:58

So this is the issue of writing an evaluation function and why it was thought to be impossible for Deep Blue or any chess program. What you can do is write a set of handcrafted, pre-programmed heuristics or rules. In fact, a first approximation for chess. If you just count up the value of the pieces on each side, that gives you a very rough and ready but reasonable estimate of which side, black or white, is winning.

34:22

That's of course, impossible for go because each of the pieces are worth the same. They're just stones. So there isn't any idea of sort of materiality. So go then and y, which is why we've taken this on as a challenge combines intuitive pattern recognition with logical planning and such. So I'm just going to take you through the technicalities of how we did this. So what we did is we trained to do deep neural networks to deal with some of these intuitive part of go.

34:52

So the first thing we did is we downloaded 100,000 games played by relatively expert humans, though still amateurs playing on Internet go servers, but they're a pretty strong club players. So we took those hundred thousand games and we trained our first neural network, which we called a policy network.

35:12

And this was done through supervised learning. And what we did is we got this network to try and mimic the moves, copy and predict what move in a particular position that human amateur expert would play. So this network was trying to copy those expert players. So that was the first step. Once we had the first version of that, we then allowed it to play against itself many millions of times and improve its prediction capability through the use of reinforcement learning.

35:45

So it learned through trial and error, and it's from its own mistakes. And then that would then modify the neural network to make it better, incrementally better, over time. And once we finish this self-pay process, the new policy network could be the original policy network 80% of the time. Now we freeze this sort of final reinforcement learning policy network, and we allow that to play a final 30 million times on our Google Cloud service, and that generates our new dataset.

36:17

And we take one position from each of those 30 million games. So we have 30 million positions. So now we finally have a dataset that maybe is big enough to try and learn an evaluation function. So what we do is we have these 30 million positions and we have the end result of the game. So we can try and learn sort of correlation between that position and who ends up winning. So we then train this final network, which we call the Valley Network.

36:49

And this Valley Network learns to predict who is winning the game from a particular position and by estimate, by how much. So this is really the core of the breakthrough with AlphaGo was the value network is this fabled evaluation function. But instead of writing it out by hand, like something like deep blue, where we as expert go players or chess players wrote out all the rules that could evaluate position by hand a big database of rules.

37:18

We instead have a neural network that learns for itself directly from the data. So we take these new networks forward and we have two new networks, Policy Network. This network in green that I was showing you earlier that takes the board position in in blue here as the input and the output is a probability distribution over the possible moves. And you can see here, the height of the green bars is the probability mass assigned to that particular move by the policy network.

37:52

So what this means is, is that our AlphaGo system doesn't have to consider all these 200 possible moves every time it's looking at a decision point. It can maybe just look at the top three or four most sensible or most likely moves. The second network, the network in Pink, which is the Valley network that also takes the board position in as an input. But this time, the output is a single number, a real number between zero and one, where zero is meaning Y is winning by huge margin one.

38:23

One is blacks winning and 0.5 is the games. Even so, estimates are who is winning the game and by how much? So we now take this forward into and combine it with search, and I'm going to show you that in a second. But I just want to give you pictorially an idea, illustrate to you why using these two new networks helps with make a plane go tractable.

38:49

So imagine here we're searching through the game of go and each of these little nodes here represents by these mini boards a position in a particular game we're playing now. The Tree of Possibilities branches out almost to infinity, these huge number of possibilities which are completely not tractable to search. So what we do is we firstly take the policy network, this network in green.

39:15

And what that does is reduce the breadth of the search so we can hone in to only look at the moves that are plausible and sensible. So that reduces the breadth of the search. And then the value network you can think of is reducing the depth of the search.

39:32

So instead of having to search through the entire game tree until the end of the game to tell you whether which side is winning, we can call the value network at any time and and estimate which side is winning so we can truncate that search at any depth level that we want. So you can see by using these two networks in tandem, we've cut down that enormous search base to something much more tractable.

40:01

So I'm going to show you how we do our search now. So we use Monte Carlo tree search and we also use another thing called roll out policies. And we combine that together with the two new networks I've just shown you. So let's imagine that we are making a decision. AlphaGo is in the middle of thinking about what move it should make next, and it's done a bit of searching from the current position, which is that node at the top of the tree.

40:26

That's our current position and it's found a couple of promising moves. Now the value of each move is represented by the letter Q here, the action value of each move. And what we're trying to do is find the move, in essence, that has the maximum Q. And what we might do is we might follow a trajectory that has quite high Q value. And you can see this in the bold black arrows, and we follow that trajectory until we hit a node that has not been explored yet in the game tree.

40:56

So here on the left hand side of this tree that we're unfolding. Now, once we hit that new note, what we do, the first thing we do is we call our policy network, the Green Network, and we ask the policy network to expand the tree at that point, but only expand it with a few moves that it thinks are most probable. So with the highest p the prior probability of that move.

41:21

So once that's expanded, then we call the second neural network the value net to evaluate that position and give an estimate of who is winning. We also do a second thing. We call if we have time, we do rollouts to the end of the game. Maybe a few thousand of them to collect statistics. Also true statistics about who ends up winning the game from that position. And then we combine both these two estimates, the estimate from the value network and the estimate from the roll out policy.

41:55

And and we combine them together to give a final evaluation of the promising ness of that branch of the tree. And once we get this new Q value, we then back that up the tree and update the the connections and the choice, the decision points. And then finally, once we run out of time for searching and thinking and we have to make a decision, we then, in essence, pick the action, pick the move that has the most promising.

42:24

Q value associated with it. So once we built AlphaGo, how did we evaluate it? Well, the first thing we did back in April last year was play against the other strongest go programs available at that time. So we tried against two Crazy Stone and Z, which are the strongest programs out there other than AlphaGo. Now I'll just explain about the scale that we're going to show here on these bar charts.

42:53

On the right hand side, Dan and Q levels, which are the ratings that you get when you play go and they go when you are big enough in. Q From like 25? Q Down to one. Q And then as an amateur, you go from one down to about six or seven down, and then you can become a professional if you pass certain productions and you start again from one down to nine down the professional level.

43:17

So that's what the three landings are, the yellow beginner, orange, amateur, red professional on the left hand side, or numerical equivalents of those down ratings. So we call them ELO ratings, and this is our rating scale from zero to about 3500. And you can think of the way of thinking about this is that if you have an E, no rating difference between two players of 200 to 250 points, that translates to about an 80% win rate for the higher rated player.

43:49

Right. So it's a kind of Bayesian sort of comparison between the strengths of these different players. And what we found is AlphaGo, when we played against these other programs, could beat them more than 99% of the time. In fact, nearly 100% of the time. And there was a huge margin between AlphaGo and the next best program, Crazy Stone of around 1200 ELO points.

44:12

And some of you who've been following this may know that Facebook have also have their own program that they're working on called Dark Forest. But that's not even as strong as Zen or Crazy Stone. In fact, it lost Tarzan on a landmine tournament last month. So it's estimated to be around the same level as them. So there's around 1200 low point difference between AlphaGo and these other programs. So we needed a greater challenge. So we thought, well, we're ready to play a top human professional.

44:42

And so we contacted Fan Wei, who is the reigning three times European champion. He's a two time professional and he started playing Go at the age of seven back in China where he grew up. And he turned professional in China at the age of 16. And in China it's one of the most competitive places to try and become a professional. So this was very exciting for us back in October and we didn't really know how well we were going to do.

45:05

We knew that we were much stronger than other commercially available programs, but we didn't know. We obviously it was a lot better than any of us on the team, so we didn't know how strong it would be against a human opponent. So this is what happened. I think after what they did, maybe a little like fight in like play slowly. So it's why became the second game. I fight with things on. I see. Maybe I'm right. It's why it was another game.

45:38

I fight all the time. It's. No, it's not nice, but I lose all my. He's a really great guy, actually. He's a really good sport. So AlphaGo won five nil, which was very surprising to us. We were hoping to win at least one game, but five no was was was pretty amazing. And this story ends well. They don't worry. He's he looks anguished here. But we actually then hired him as a consultant for our team ready for the next world match. So in the end, he's on the side of the computers now.

46:17

But one interesting thing, actually, is that he's since played a few more games against AlphaGo informally, and he feels that it's actually improved his own play. And very recently he won the European Professional Championship again and he beat with a full score. He beat every single other professional in Europe. So he feels he's got stronger by training against AlphaGo, in fact, which is quite interesting.

46:41

So anyway, he's around here on this measure. He's around 2900 ELO and AlphaGo at that time was around 3100. Again, this is covered in a nature paper that came out a couple of weeks ago on the front cover, and it's caused a huge stir in the air community. And I encourage you to read that if you want to hear much more of the technical details which are outlined in that in that paper. So I just want to just explain take a minute to explain the critical difference here between AlphaGo and Deep Blue.

47:16

So although this is a big achievement, go beating a professional player at GO is a long standing grand challenge of AI research. And many people have been working, many smart people working on this for over a decade. And in fact, this happened about a decade earlier than many experts in the field. The top programmers of the other guy programs, for example, thought it was going to happen even even from like last year. But the key thing for us is no ICD treatment, but how we did it.

47:49

So we've used general purpose algorithms, deep learning reinforcement learning, tree search. These are general purpose algorithms, and we've put them together in a way that learns how to play go. It's not a handcrafted set of rules and heuristics like Deep Blue or chess programs, and it's also a modular system that combines pattern recognition with planning algorithms. So that's another thing, is that deep learning is hugely popular right now, very fashionable.

48:18

But we think and we think it's critical. And, of course, we have a huge, deep learning team, many amazing deep learners at DeepMind. But we don't think that's the whole story on its own. We think that it needs all the other things are going to be required, like reinforcement learning and memory and other advances combined with deep learning to reach full intelligence.

48:41

And because of the way we train AlphaGo, many people have commented, many professional players commented about how humanlike that it plays in this playing style is and how it thinks. And if you think about it, AlphaGo has been trained in a way like a human expert player, starts off by studying professional games and learning from that, and then improves by. Through practice, by playing games that go. So for us, what's the next step?

49:09

Now, as Mike alluded to, actually only about a week and a half away from this is the next step is to take on the world's best player. Lisa, Don, and he's from South Korea. He's a legend there. He's sort of like the David Beckham of South Korea, believe it or not. And I describe him as the Roger Federer of GO, because he's been at the top of the game for a decade,

49:33

but he's still one of the top three players in the world. And he's won 18 international titles, kind of like Grand Slams over the last decade. And we're challenging him to $1,000,000 match, five game match in Seoul in March 8th to 15th. And you can follow that on on YouTube live stream. And, you know, he's taking this pretty seriously.

49:54

Obviously, there's the money on the line, his reputation. But when he was asked by the South Korean press how he felt about the game, he said, I'm not sure if I represent the whole of humanity, but I think I am. So it's good that he's confident that he's going to win the match. So he's actually a lovely guy as well. And I'm really looking forward to going out there. And it's it's it's crazy out.

50:17

We did a press conference yesterday and via video call and there were over 300 journalists, including live TV cameras for a video call. So it's pretty crazy. So we're going to we're very excited to see how it's going to be like when we go there. But Lisa, though on our ELO measures, he significantly better than fans, where he's a couple of notches better fan, where he's kind of like a grandmaster level. But there's another level to reach sort of the world elite.

50:45

So he's at least 600 ELO stronger. So we have to go some if we want to beat him from where AlphaGo was back in October. So my final slide and go is talking about how do we do this testing? Well, we have our own internal testing where we have running 24, seven different versions of our program playing against itself. And we can make accurate estimates of how strong our product we think our program is from this continual live tournament that that's going on in the cloud.

51:15

But every now and again, we have to calibrate those internal tests with external testing. So we need to test against these external benchmarks. So in April, we tested again, Zen and Crazy Stone. We won over 99%. Then in October, our new version, our October version could be our April version and 100% of the time. And obviously, we were playing fans who we also knew could beat these these other top commercial programs 100% of the time if he was to play it.

51:47

And so we knew we were at least roughly matched. But in the end, as you saw, we won five now. So now we're at March, coming up to March, and we're playing Lisa Dole. And Lisa Dole would on the low ratings, you would expect him to win around 97% of the time against Fan. So it's a huge step up. So it's obviously confidential. Our number till the match that we we've got on the left hand side. And obviously the million dollar question is what's going to happen when we play him?

52:16

So it's going to be very exciting to see. So I just want to give a big shout out to the amazing team that's worked on AlphaGo, led by David Silva and ajoke Wang as the team leads by some incredible work has gone on on this. Now, of course, playing games is great fun and it's very efficient for advancing our A.I. research.

52:38

But we also want to apply these technologies to the real world, and we plan to make some announcements about this over the next year in health care, in robotics, and in smart assistants. All these different areas, we feel that extensions of and components of what we're building for things like AlphaGo can be used very powerfully in these areas.

53:00

So I just want to end the talk with a couple of high level thoughts and why I've been so obsessed with it for my entire career and why I think it's so important. I see two big challenges facing society today information overload, which is deluged as users and scientists with data everywhere. Big data from genomics, entertainment, every field sphere of of human life.

53:25

Now, personalisation might be one technology to try and combat that, but unfortunately doesn't work very well because it mostly is based today on the averaging of crowds rather than actually adapting to you as a person, as a thing, as an individual. Then secondly, the systems that we would like to master are so complex today, from climate to disease to energy macroeconomics, high energy physics. So, you know, you have to think that maybe the complexity of systems is so great.

53:52

It's difficult to imagine how even an Einstein, someone at that level, can master these systems in their own lifetime and still leave enough time for innovation. So we think a d mind that solving AI in a fundamental way like we're trying to do is potentially a kind of better solution to all these other problems. If we can solve A.I. in this way, we can bring it to bear on all the other issues that we would like to solve.

54:15

So the dream is really to make for me anyway, is to use this kind of AI to create A.I. scientists or A.I. assisted science. And finally, I should mention a word about ethics. As with all powerful new technologies, they have to be used ethically, responsibly, and AOA is no different. And even though human level Jan is decades away, we should start the debate now.

54:37

And as a neuroscientist, I think trying to distil intelligence into an algorithmic construct and then comparing it to the human mind, what actually this journey we're on is will be one of the best ways to better understand the mysteries of our own minds. And things shed light on, on, on, on sort of things like dreaming, creativity, and perhaps even the ultimate question of consciousness. Thanks for listening.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript