Strachey Lecture: Probabilistic machine learning: foundations and frontiers

00:01

George. Okay. So I'm full blown. So I'm going to introduce our speaker for the stretch lecture this term. Firstly, I'd like to say a big thank you to Oxford Asset Management, our sponsor who makes this series of lectures possible. I've also been asked to draw your attention to our hashtag that's prominently placed in case you would like to tweet. And also anyone interested in our software engineering program can find brochures outside and people to talk to.

00:41

So, okay. But onto the mean business. So it's my pleasure to introduce Stephen Carbone, who will give our Hillary term strategy lecture. Zubin is Professor of Information Engineering at Cambridge. He's a fellow of the Royal Society. I think it's fair to say that his machine learning group in Cambridge is would be one of the most influential over the last decade.

01:06

It's hard to it's hard to go to any major machine learning academic group or industry lab without finding Rubin's ex-students or postdocs. So and more recently, Zubin founded Geometric Intelligence A Start-Up. And after its acquisition, he's now the co-director of Uber, Eli Labs. And maybe if you're off nicely, he might tell you a bit about that. We'll see. Okay. So you've made contributions across machine learning, particularly probabilistic inference, even deep learning.

01:41

It's not surprising. He was Mike Jordan student and did his postdoctoral work with Geoff Hinton. So a pretty amazing pedigree there. So really, Zubin, seminal work is really invasion non parametric where he's really been leading this idea that it's not just enough for now machine learning research to aim for for accurate predictions. We also need to be able to quantify uncertainty. We need to be able to talk about causation.

02:09

And if we really want machine learning and AI to have an impact in industry, we need to be able to tackle those things. I'm sure he will tell you all about these things. So please welcome Zubin for the talk. Thanks, Phil, for that great introduction and a great thank you to the Department of Computer Science for inviting me. Okay. Can you all hear me? Yeah. Good. So I'm going to talk about probabilistic machine learning, which is my passion, is the thing I'm really excited about.

02:45

And I'll start from basics and as the talk goes on, will get into more and more current research, more of what we're actually really doing these days. So that's why the subtitle is Foundations and Frontiers. Foundations is meant to be, you know, motivation, background material. But if you're bored by that, don't worry, it'll get more technical later on. Okay, so let's start from the basics. Machine learning. Well, what is machine learning? It's just a term. There are many other related terms.

03:24

You know, depending on the community that you come from, you might think about data mining or artificial intelligence or statistical modelling, neural networks, pattern recognition, sort of a bit of a more old fashioned term. All these terms are related. I'll focus on the term machine learning, but keep the context in mind in terms of academic disciplines.

03:46

This is also a very interdisciplinary area in that we draw from ideas in computer science, engineering statistics, applied mathematics, and we get a lot of inspiration from cognitive science, economics, even tools from physics and neuroscience. And then why are people interested in machine learning these days? Well, it used to be kind of an interesting academic field where you sort of played around and you kind of tried to get computers to learn from data.

04:19

Most people didn't care much about it, but now suddenly lots of people care. And the reason lots of people care is because there are many, many applications of machine learning. It's sort of I like to think of it as the invisible thing that's behind a lot of the more visible applications that involve computers learning from data. So let's just go through some of those applications just to motivate.

04:48

Speech and language technologies is an area that has been transformed by the use of machine learning. So automatic speech recognition, machine translation, question answering dialogue systems. Then every year we seem to get more and more advances in these sorts of tools. Computer vision, again, a field that has been around for a very long time.

05:12

But with the advent of large amounts of data and more powerful computational tools, we're able to now do interesting things like not just object, face and handwriting recognition, but image captioning going from an image to a bit of text that's meant to describe the image. And these are this is from a very famous paper. And, you know, you can actually pick it apart in the sense that you could say, well, these are hand chosen to make the algorithm look good.

05:46

You know, man in black shirt is playing guitar. That seems pretty amazing that a computer could take an image like this and produce this description of the image. It doesn't always work that brilliantly, but I would say that most of us in the field were stunned when we saw this happen for the first time that we could actually get a system that would produce some reasonable descriptions from images. Of course, we all have cameras in our pockets that put boxes around people's faces.

06:19

If you ever ask yourself, well, how does that work? Well, that's a bit of machine learning that runs on all of your camera devices. Moving into the sciences, a lot of the sciences have become very data heavy. Fields like bioinformatics and genomics in the medical sciences, but also astronomy areas where we're now able to collect much more data than any human being could sit down and analyse manually.

06:49

And so machine learning and AI tools have been very important in scientific data analysis, and that's something I'll talk about maybe a little bit later on as well. Recommender systems, we all know, but these are, you know, customers who bought this item, also bought this kind of thing that's driven by machine learning, self-driving cars, something that I'm now much more involved in. This is not a totally new thing.

07:18

I mean, this self-driving car, Alvin was around about 30 years ago, and he used neural networks to drive around at 70 miles per hour on highways. That's what it says on this slide that I took from about 30 years ago. That's very scary. I would not want to be anywhere close to that truck driving at 70 miles an hour on a highway driven by a neural network that's about this big. But things have moved on and we now have pretty good self-driving systems that are just getting better every year.

07:54

Robotics. I just love dogs playing football. So robotics is this this particular sort of RoboCop isn't necessarily driven by machine learning, but there are a lot of excellent uses of machine learning in robotics, automated trading, financial prediction, computer games you're all familiar with, you know, the the DeepMind landmark results first with learning Atari games, playing Atari games at a human or superhuman level, then more recently beating the world master at goal.

08:36

And who knows what this is? This is Lee Brutus, a system that recently won a poker championship. And this was a very against a whole bunch of humans. This is all the numbers in parentheses are how much money the humans lost to the computer. And the very interesting thing about this is that this is quite a complicated game in that if you think about poker, what does it involve?

09:08

Well, it involves things like trying to understand the state of mind of the other player and bluffing and things like that. So to be a good poker player, you have to be able to do those things. And so now we have good machine poker players as well. So what is it? Well, machine learning, if if I had to define it, I would use a sentence like this. It's an interdisciplinary field that develops both the mathematical foundations and practical applications of systems that learn from data.

09:39

Here are some of the main conferences and so on associated with that field. So that's all in terms of motivation from applications. But actually when you look at machine learning systems, most of the time machine learning systems are are trying to solve one of a few canonical problems. So I'll just go through those canonical problems in my kind of introductory part of the of the lecture. So this is probably the most canonical problem, the classification problem.

10:08

You have some data, you want to classify it into two or more classes. So the task is to predict some discrete class labels from input data that has lots and lots of applications. And there are a lot of buzzwords for different methods that can be used for classification. These are just different ways of trying to do classification from data. Regression, trying to predict some continuous quantity Y from some inputs X. Obviously, this has lots of applications as well.

10:41

And, you know, there are lots of methods, some of which you might say, well, that's not a machine learning method, that's linear regression. It's been around for over a hundred years. But, you know, again, remember, this is all in the context of everything that's been going on in all of these neighbouring fields. And there's nothing that says, oh, this is a machine learning method and that's not a machine learning method.

11:03

If it's just if it's making predictions and decisions from data, it is a machine learning method at some level clustering. The task here is the group data together so that similar points are put in the same group. Many applications, again, many different methods. Dimensionality reduction. When you have very high dimensional data, you might want to find a low dimensional representation of that data that preserves important information.

11:31

Another canonical machine learning problem semi supervised learning where you might have some labelled data. Here you might have a few label points like these two label points that are minuses and these three that are pluses. And you might want to basically be able to leverage the fact that you have a lot of unlabelled data as well.

11:55

And so semi supervised learning combines labelled and unlabelled data to get better predictions and reinforcement learning, which is related to sequential decision making and adaptive control. The task there is to learn to interact with an environment, making sequential decisions to maximise future rewards. So it's an interactive setting where you have an agent producing some actions or decisions in an environment.

12:22

There may be some hidden state to both the agent and the environment, and then you get some observed sensory inputs and the agent has to be has to act in the environment to maximise its rewards. Okay. So these are the canonical problems. It is actually quite bewildering if you start reading the machine learning literature and you're not an expert because there are many, many different methods and you know, every paper seems to present a new method.

12:53

And so here is this sort of a very crude way of organising a bunch of machine learning methods. But don't give this too much input, too much weight on this. Okay. But I'm going to focus on for the first few minutes, I'm going to focus on one bubble here, which is this neural networks and deep learning one. And the reason I'm focusing on that should be for any of you who is familiar with the field, it should be pretty obvious because these methods have been really revolutionary.

13:22

They've really been involved in some of the most spectacular breakthroughs in the last few years. So what are they? Well, a neural network. And I'm going to focus here on a feedforward neural network. Just for simplicity, there are other kinds, but a feedforward neural network. The most standard one is essentially just a function approximate. So it takes some inputs, called them X and it produces some outputs.

13:57

Call them Y. And the way it produces them is through a sequence of transformations organised in layers. But all of that is in a sense a bit of a detail. It's just a way of representing a function that maps from X to Y via tuneable parameters called weights. Or I'm using theta t to note note the denote the parameters of the network.

14:27

So neural nets are I mean, one of the important aspects of neural nets is that they're nonlinear functions and they're often both nonlinear in the input and nonlinear in the parameters. So optimising them to minimise some objective function tends to be slightly complicated. The other defining characteristic of neural networks is that they represent the function from X or Y in layers, which is essentially simply just as a composition of functions.

15:01

Okay. So here is a multiplayer neural network with one hidden layer represented as a function that maps from Xs to Y through some parameters. And these super scripts here one and two you the the two layers of parameters that you have. These neural networks are usually trained to maximise some likelihood, so they fall very squarely within the world of statistical models using some variant of stochastic gradient descent optimisation. So this is where we start using tools from optimisation theory.

15:35

Okay. So that's one slide on neural networks. And these things have been around for many decades. In fact, these things are what got me excited about AI back in the eighties when I was sort of an undergraduate and thinking about what to do with my life. But what's happened is that something dramatic has happened between the 1980s and now. And one of the things that's dramatic is that the terminology has changed. So people now call these deep learning systems because they have many more layers.

16:13

But there are other, more interesting, dramatic things that have happened.

16:16

So these deep learning systems that are involved in a lot of these very impressive benchmarks are very similar to the neural net architectures from the eighties and nineties, with some important architectural and algorithmic innovations like being able to use many layers and particular nonlinearity, such as the real you particular ways of regularising them like dropout and very useful tricks for dealing with time series like the.

16:49

They are also based on vastly they're trained using vastly larger data sets, really web scale data sets. To do that, you need vastly larger compute resources, so GPUs, GPUs on clouds, etc. Importantly, there's been a major effort to democratise the software tools so that it's quite easy to actually train a neural network. So we have much better software tools, things like torch and TensorFlow. And of course, there has been vastly increased industry investment and media hype.

17:25

And what that what that has meant is that there is a huge influx of people trying out different variations of neural networks on different problems. And stepping back, I kind of think of this a little bit of as the community of machine learning researchers is running a bit of a genetic algorithm, trying out lots of different ideas and variations and ideas to be able to improve on the performance of existing benchmarks.

17:57

Okay. So that's that's deep learning in a in a nutshell, there's huge amounts more to say about that. And there are many better people than me to talk about that. But one thing I do want to talk about is limitations of deep learning. So let's step back from the excitement. Let's acknowledge the excitement and let's say, well, where do we go next? What do we need to focus on? And I would argue that there are a few limitations we really need to think about.

18:28

So one of them is that neural nets are very data hungry. You often need millions of examples to train these large models. And that should not be surprising. If you if you know a bit of statistics, perhaps the surprising thing is that you don't need that many millions to train models with millions of parameters.

18:48

People would have thought that would. That was crazy. And it is surprising that you can get away with, you know, relatively small amounts of data, even though it's large by the standards of the eighties and nineties. They're also very compute intensive to train and deploy. They're poor at representing uncertainty. And this is something that I'm particularly interested in.

19:12

There are some great studies that show that neural nets and deep learning systems can be easily fooled by adversarial example. So you can construct examples that will make the neural network very confidently give the wrong answer, and that should be worrying.

19:30

That relates to the uncertainty thing. It's okay for a system to make mistakes, but it's not okay for it to be really confidently making mistakes because then you don't know when to trust the answers and you can't really build mission critical systems, things like in, let's say in the health care domain or in self-driving cars and so on. If you really can't trust the confidences of your model, they're finicky to optimise.

19:59

You know, optimisation is non convex and there are many different parametric architectural choices that need to be made. And they're generally on interpretable black boxes, lacking in transparency and difficult to trust. Okay, of course people are working on all of these things, but I wanted to put them on a slide to sort of motivate us to move towards the interesting challenges that we have.

20:24

A particular area that that I'm really interested in, which Phil mentioned in the introduction, is thinking about machine learning as probabilistic modelling. So let's go beyond deep learning. I'll come back to neural nets and deep learning in a minute in the context of probabilistic modelling. Let's go beyond deep learning and let's talk about a particular view of machine learning that's grounded in the idea that we want systems that will build models from data,

20:57

probabilistic models from data. So what do I mean by a model? The term model gets used by many people in different contexts. What I mean is a model describes data that one could observe from a system. Okay. So it should models should be able to make predictions. It should it should say make statements about observable data. If it doesn't do that, then it's very difficult to know if you have a good model or not, whether you have a falsifiable model, for example, or not.

21:31

Now, if a model is making statements about possible data that could be observed, then what we're going to do is we're going to use the mathematics of probability theory to express all forms of uncertainty and noise associated with our model. So think about a simple model. Let's take a let's say a model that does forecasting of the weather tomorrow. Okay. That's not a necessarily simple model. One could certainly build a simple version of that.

22:02

Okay. Now, you don't want models that make forecasts that don't tell you how uncertain they are. And now you have to consider where are all the different sources of uncertainty that you could have in predicting the weather tomorrow? You might have uncertainty that's coming from the noise in the sensor data that you collected. You might have uncertainty that's coming from the fact that there are unpredictable effects that your model did not consider.

22:33

Your model might have parameters and you might be uncertain about what the right parameters are. All of those sources of uncertainty we need to deal with somehow. And what we're going to do is we're going to use the language of probability theory to express uncertainty. And to me, that is as fundamental as saying that we use calculus as the language to express rates of change. Probability theory is the language of uncertainty. Then the good news is that we don't have to invoke anything else.

23:03

We can just stay within the framework of probability theory to infer aspects of the model from data, to adapt our model to data, to make predictions, etc. So it all ends up being very, very simple. And here is what it looks like. Here is Bayes rule, which is the sort of engine that drives learning from data. And I'm colour coding things into two classes data and hypotheses. And what I mean by data is anything that's actually measured a measured quantity.

23:47

And what I mean by hypotheses is everything else okay? The world, from a basic point of view, is divided into two kinds of things. Stuff you're measuring and stuff you're not measuring. Okay? And the stuff you're measuring, you've measured. So you kind of know what it is. It could be noisy, but you've measured it. And the stuff you're not measuring, you better represent the fact that you're uncertain about it because you didn't measure it. Okay, so all of those things we call hypotheses.

24:19

Okay. So that's not the only thing there. I said that these hypotheses, if we think about these, is as if we're trying to express models of data. We're going to use probability theory to express our models. So basically for every potential configuration of our hypotheses, we should be able to describe what is the probability of the observed data under that hypothesis.

24:49

That's the term that's called the likelihood, and that's actually what drives most neural network learning is maximising likelihood or penalised likelihood of some kind. But forget about neural nets. Now we're talking much more generally. We have this term, which is the likelihood, which gives you the probability of the data given the hypothesis. And then we have this term, which is called the prior. And the prior is our representation of our uncertainty about everything we haven't observed.

25:20

Before we get our data. So the game goes like this. Before we have our data, we have to place our bets on all the unobserved things we use the language of probability theory to do that. So we put a probability distribution over our space of hypotheses.

25:38

Then we observe the data. Aha! That's the beautiful moment where we can now compute the likelihood, the probability of the data, given the hypotheses and the simple rules of probability, tell you you multiply these two, you re normalise over all the hypotheses that you've been considering. And then what you get is your new state of knowledge, the posterior distribution of your hypotheses, given the data, and that is the prior that you would use if you got any more data.

26:05

So there's nothing really fundamentally different between the prior and the posterior. It's just the representation of your state of knowledge at any point in the process with the data you've observed so far. Okay, so learning and prediction can be seen as forms of inference using this this rule. And here is the slide that I it's a one slide description of Bayesian machine learning that I always use apologies for people who've seen it.

26:37

But the point is that even Bayes rule that I had on the previous slide is not a fundamental rule. The fundamental rules of probability theory are these two simple rules the thumb rule and the product rule. And the sum rule tells you that the probability of some unknown quantity X is the sum over some other unknown quantity Y of the joint probability. So the this is called also sometimes called the marginalisation rule.

27:07

And the product rule says that the joint probability of X and Y can be factored into the probability of X times the probability of Y given X or the other way around. So from these two simple rules, if we substitute X and Y with data and hypotheses, we can get Bayes rule, which we got in the previous slide. If we use the following symbols theta to represent the parameters of our Model D, to represent the observed data, and M to represent the model class that we've assumed.

27:42

Then we get this expression here, which is just Bayes rule apply to parameters of our model. What would the parameters be? For example, in a neural net they would be the weights in the neural net and linear regression, they would be the linear regression coefficients, etc. Every model has parameters in this world. Okay. And this is the prior that's the likelihood. And this term here is the normalising constant, which is itself quite interesting.

28:11

It's called the marginal likelihood. Now this follows from the salmon product rule. If you want to make predictions about any unknown quantity X given the data, then the salmon product will tell you that the way you make predictions, there's only one valid way under this framework, and that one valid way is you consider the predictions made by every possible parameter value.

28:38

So those are these terms, and then you weight them by this term in green, which is the posterior probability of the parameters given the data and the model class. So the act of forecasting or predicting any unknown quantity X given the observed data is by the salmon product rule an averaging process you have to average over all the hypotheses that you've considered. You don't pick the best one or your favourite one, or you don't flip a coin or anything like that.

29:07

You're supposed to average over the space of hypotheses in this particular way. And if you now want to compare different model classes, then you might apply Bayes rule at the level of model classes. And that looks like this where this term in red, the marginal likelihood now appears in the numerator and denominator. None of this is actually mysterious. They all follow from from these two rules. What do I mean by model? Comparison. Model comparison might the story might go like this.

29:40

Okay, let's say I'm a biologist. I do an experiment and I have a colleague and my colleague says, I believe that, you know, this transposition transcription factor regulates these genes. And I say, no, I have a different model. I believe that it doesn't and that this one does or something like that. So my colleague and I have two different models now.

30:03

We could argue about it in words, but if we follow this probabilistic framework, what we should do is both of us should write down the model to the specification level that it could make predictions about observable data. We could assign a probability to the observable data, and then we observe the data. D And now we can settle the argument. We basically say, All right, what is the marginal likelihood that that your model gave to my data?

30:32

What is the marginal likelihood my model gives to the data? Well, both of our models had some free parameters. Maybe your model had 17 free parameters, and my model had three free parameters. So my model is simpler somehow. And I want I don't know, I get nervous, I say, that seems unfair. Okay, so your model had more parameters if if my colleague goes and optimises those 17 parameters, then sure enough she can fit the data much better than I can.

31:02

Right. But that's not the game optimisation doesn't follow from the some rule in a product rule. It doesn't matter that my colleague has 17 parameters and I have three. If we can both compute the marginal likelihood, then we can settle this argument. Okay, so I actually really strongly believe that in an ideal world, science would be done like this. People wouldn't just publish their papers in open journals and share their data in an open manner.

31:29

I think actually people should write down their models in a way that one could evaluate with future data, maybe write them as probabilistic programs, which I'll talk about later. And then we could really do objective. Well, actually, it's subjective, but, you know, we could do a sort of principled comparison of models giving different subjective opinions about what the hypotheses are. Okay, so one side on basic machine learning. So why should we care about all this?

32:02

We've had a revolution in machine learning with wonderful, fantastic, deep learning methods that never talk about base anywhere in them. So why should we care about all this Bayesian stuff? Well, the reason I care is that I'd really like models with calibrated senses of uncertainty.

32:23

So I want to be able to trust my system. If it says the probability of there being a pedestrian in front of my car is 0.1, I want that to mean 10% and I can take actions that correspond to that calibrated probability. Getting systems that know when they don't know, I feel, is very important. Also, there's a very beautiful thing about all of this, which is that unease about like 17 parameters, which the three parameters or different structures of models.

32:57

Well, this framework actually gives you automatic tools to compare models of different complexity and to automate the learning of models from data. And this is called Bayesian Occam's Razor. And it's something I will use in the latter part of my talk. Okay. So let's go back to our neural networks and just to ground the discussion a little bit. Here's a neural network and maps from X to why there are different sources of uncertainty here.

33:29

One of them is parameter uncertainty. We have weights in the neural network. And, you know, given any finite amount of data, we're not sure what those weights should be. So we need to represent our uncertainty. But we also have structural uncertainty. We've made some structural choices like the architecture, a number of hidden units, our choice of activation functions. And that's also a source of uncertainty. So it would be great if we could represent all of that. And that's not a new idea.

34:00

None of this is really new ideas. In fact, the idea of doing Bayesian analysis of neural networks has been around since the early nineties, at least to actually late eighties years. A bit of a history of a few different methods. Here is a depiction of what we'd really like. So here's a system that was trained to do some regression on some data.

34:26

And what we'd really like is this sort of behaviour that outside of the range of its training data, it should say, I don't really know and there are many ways of doing that. These are all different ways of doing that. And we had a nice workshop at NIPS on Bayesian Deep Learning, where we kind of brought that history together and looked at some of the current state of the art.

34:51

So this world machine learning often has camps and people think that you have to be in one or another camp, but you don't actually you have to understand what all the tools are in the different camps, and there's a lot of fertile ground at the intersection of these camps. And that's this is one example of those things. So when do we need probabilities? Well, we need them when we, our system are, you know, learning an intelligence problem depends crucially on representing uncertainty.

35:24

I've sort of said that. But let me describe some examples of that. So any time we're doing forecasting and, you know, that could be financial forecasting, weather forecasting, forecasting demand at Uber or for Amazon products or whatever. We need to represent our uncertainty decision making. Generally, when you make decisions, you're thinking about the consequences of your actions into the future. And it's really useful to represent uncertainty there.

35:56

It's hard to imagine not doing that at some level when you're learning from limited, noisy and missing data. So if you imagine dealing with, say, medical records, if you're trying to do machine learning and medical records, you have patients, your patients and each of them has lots of things that are unobserved. They maybe there are few medical tests that have been done on each patient.

36:20

Most of the data is actually missing. If you look at that, look at it that way if you want to learn complex personalised models. So it might be, again, whether it's in a medical domain or in a retail domain or something like that, you might have you might think you have a huge data set, but actually for every patient or every customer, you only have a little bit of data, right? So it's not really a big data problem. You need to represent uncertainty about that individual.

36:50

The whole field of data compression is based on probabilistic modelling and a lot of my interest in automatic model discovery and experiment design is really based on. Uncertainty. Now, over the last three months, I've been involved in setting up Uber's AA labs. I'll just mention that in one slide. Why would Uber care about any of this? Well, if you look at many of the problems that a large technology company has to solve.

37:24

There are problems that deal with uncertainty, decision making, personalisation and so on. There are huge number of problems. There are huge number of opportunities around any of the major technology companies for learning from data and for using uncertainty in there. And, you know, fairly obviously, if you're trying to build a very complicated system that makes decisions in the real world like a self-driving car, you'd really like to have calibrated uncertainties in that system. Okay.

37:58

So here is the one slide picture of my current passions, my current research interests, and then the next few minutes and I leave a few minutes for questions at the end. In the next few minutes, I'm going to touch on a few of these topics. And it's fairly modular, so I can stop to give us time for questions. But I want to put this slide up here because. Well, actually, because I had this slide.

38:31

So that's one reason because and the reason I had this slide is that I was asked to give a talk about a year ago and they told me, summarise your work in one slide. So that forced me to produce this slide. And then when I produced it, I thought it was actually kind of a useful exercise. So so the, the useful exercise is that it crystallised in my mind the thing that really drives me. And, you know, it's not that I'm a Bayesian and I just love probabilities or anything like that.

39:04

It turns out the thing that really drives me is that. I like stuff that's automated. I don't I want things to be systematic and automated. And computer scientists are very good at that. Like computer science, if you put your computer science hat on, you do something three times and you think, Oh, I need to write a computer program to do that for me. Three times was two times too many. Right. And. And the sorry state of machine learning is that stuff is not really automated.

39:41

There still is tremendous amounts of human labour, arbitrary decision making and tweaking involved in deploying machine learning systems. Which is ironic. The whole field is about getting systems to learn from data. But then there's there there are a lot of well-paid researchers and engineers tweaking those systems that learn from data. So let's think about automating these things. And this is what drives me. So if you look at some of these topics, which I'm going to talk about.

40:14

So automatic statistician, what is that about? And I'll I'll talk about that in a couple of minutes. That's about automating the process of model discovery from data. So searching for a good model from data. Probabilistic programming. Something that Frank Wood, who is at Oxford, is a world expert in policy programming, is automating the process of doing inference from a very general probabilistic model. We also want to automate. Optimisation.

40:49

So optimisation is actually a sequential decision problem. If you have an OPTIMISER that's trying to optimise a function, it's making decisions about where to evaluate the function next. Collecting some data and then moving on to another point and so on. People don't think about optimisation that way. They just think about, here's an algorithm and here's something I can prove about the algorithm. But actually optimisation is very much like, you know.

41:14

Bend it problems, reinforcement learning problems, sequential decision making under uncertainty is something that drives this because we want to optimise sorry, we want to automate the allocation of computational resources. So especially now that machine learning systems are very complex. Right. These these systems use a lot of memory, a lot of CPU. The datasets are very big. So we can't afford to just tinker about and run a few experiments on a single computer.

41:47

And when we run major experiments, we actually have to worry about the fact that this is running on a big, you know, cloud of computers. And, you know, that's using energy and energy costs money and it's not good for the world, right, using energy like that. So optimising resource allocation. So these are the things that drive me these days. I'm going to talk about a couple of them very quickly. Probabilistic programming is one of them.

42:17

The problem here is that developing probabilistic models and deriving inference algorithms is generally a very time consuming and error prone process. And the solution is to develop probabilistic programming languages. So what are these things? This is a very beautiful marriage between probabilistic modelling and programming languages, worlds. And the idea is that you have a probabilistic programming language, which is a way of expressing probabilistic models.

42:49

And the modern ones, the ones that people are very interested in, like Frank Wood and myself these days, are completely general programming languages, sort of Turing complete programming languages that can express any computable probability distribution. That's the expression part. But what do you do with that? Well, well, first of all, how do you do that? You express your model as a simulator, a simulator that would generate data.

43:17

That's one canonical way of doing that. And that's a very natural concept. You could say, okay, I have a model for the weather. Well, that's actually kind of a simulator. Okay. And I write it as a computer program. I have a model for my gene expression network, and that's going to be a simulator that, you know, simulates gene expression data. Okay. That's the modelling part.

43:44

But then you have some data, you have a simulator and you have some data, and what you're really interested in is inferring or learning parameters of your simulator, of your model, given the data. And the very incredible thing is that we can actually come up with universal inference engines. We can come up with inference engines that in principle could compute the probability distribution over the hidden variables in our computer program given the data.

44:17

So it's basically running Bayes rule on computer programs. We're all used to running computer programs in the forward direction. You take some inputs and you produce some outputs. But this is kind of doing it backwards. You have a computer program that takes some inputs and some calls the random number generators and produces some outputs. These are random outputs. That's the data.

44:37

And now we say, well, what should the inputs and the cost of the random number generators have been to observe this output for the computer program? That's Bayes rule on the program. And there are many languages. Now, Anglican is the one that Frank Wood's team has been developing one of the state of the art languages. Our group in Cambridge has a language called Turing, which is much less developed but also exciting.

45:08

It's based on Julia. There are many different languages developed by different groups and there are many different inference algorithms that can generally run on models in those languages. Here is, for example, a hidden Markov model written in in Turing. It's fairly easy to read if you uncomment one line of this model. You go from a regular hidden Markov model to a Bayesian hidden Markov model.

45:38

So changing models around is as easy as sort of adding and removing a few lines of your probabilistic program. And I really think that, you know, if our vision actually plays out, this could really revolutionise scientific modelling. If people were actually willing to write probabilistic programs for all of their models and they shared them, then people could take somebody else's model, run it on their data, improve it, etc.

46:05

A few resources here. I'll just give you a few examples. These are now slides from my postdoc, Hong. Okay. It's a little bit about Turing. I'll skip through that. That's our H&M example, but much bigger. This is a Bayesian neural network. Most of this is specifying the prior on the on the weights. And then this is the actual, you know, Bayesian neural network that's just sort of the neural network function and so on.

46:32

And then you could just run inference using, you know, Hamiltonian Monte Carlo or something. You don't have to even know what that is. It abstracts away the model specification from the inference and ah language. Turing is pretty competitive. It's, it's sort of in the same ballpark is Anglican occasionally a bit faster but I know that the Anglican team keeps improving their language as well. Another topic I want to talk about is Bayesian optimisation. I have basically a couple of slides on that.

47:07

So the problem here is you want to find ideally a global optimum, maybe that's too much ask of some black box function that is expensive to evaluate. So you can't just evaluate in lots and lots of places. You need to think about where you're going to evaluate your function next. And we don't want to do that manually. We want to automate the algorithm that thinks about that. So the solution is to treat the problem as sequential decision making under uncertainty.

47:36

And what we're uncertain about is what the actual function is. And this has huge number of applications. And I'm actually you know, I'll say a couple words about the automatic statistician, but I do want to leave some time for questions. So the automatic statistician is automating is trying to automate model discovery. And the idea here is what we'd like is the system where we can just give it data.

48:00

It searches over a large space of models, evaluating models according to some principled metric that trades off model complexity with the amount of data that you have. And actually the marginal likelihood, which I described, is one such metric, it produces a model and then interestingly, it translates that into a report that is then interpretable by a human being. So this is the opposite of a black box. We really want a transparent box, something that the human will be able to understand.

48:33

Okay. And again, you know, I'll actually skip over most of this because I do want to leave time for questions. So we do a search over models. This is the automatic statistician applied to some time series. It finds a good model. Then it comes up with a description of that model. It produces the text itself. So this is the executive summary of the text. Actually, the text is in the form of these documents, which are, you know, 5 to 10 pages long.

49:01

And, you know, we can have here is the report writing demo. You know, we could run this and this is a slightly different version of this, which actually does clustering. It tries to visualise things, it tells you what it's found, etc. Okay. And it tends to perform well at prediction because actually being systematic pays off. Okay. And we've applied this to classification as well, to regression, to clustering and so on.

49:36

And we're going to have a release of it, I keep saying very soon, but this time I really mean it very soon means in a couple of months, I think. Okay. So I'm going to wrap up there. This probabilistic modelling framework isn't the only way to do machine learning, but it's a really useful organising principle. There is there are many layers and it's completely compatible with the choice of models that you have and whether you like deep learning or even logic and other frameworks and so on.

50:09

We we really can hybridise a lot of these methods to produce interesting systems that reason about uncertainty and learn from data. I've briefly reviewed three topics is the review paper I wrote a couple of years ago that summarised this line of work and I wanted to end by thanking a whole bunch of collaborators I've had.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript