Probabilistic Inference and Learning with Stein’s Method

00:00

So the motivation from this work comes from large scale folks here in France. And to illustrate what I mean by that, I want you to consider this very simple example of logistic regression. So you have a data set with all data points. Each data point has a feature vector. I call it V.L. It's an Archimedes three dimensional feature vector and has a binary class label.

00:21

And the feature vector is related to the class label through an unknown parameter vector that governs the mean of of the class label. We'll call that parameter vector beta. Now, if you place a prior on that number vector, this becomes evasion, logistic regression model. And one day I want to highlight about this, about this genitive models that it's pretty simple to express. But even in this simple case, the posterior distribution over the unknown parameters is complex.

00:52

We don't know the normalisation, constant and exact, integrate and tractable. So what does one normally do? Well, one will often reach for Markov chain Monte Carlo CMC to eventually draw samples from the posterior distribution. And the benefit of this is that you can then approximate these intractable posterior expectations that are written in red. These integrals with asymptotically exact sample estimates, which are just averages over your sample plants and blue.

01:20

But there's also a drawback. And one drawback is that every new CMC sample point requires iterating through your entire observed dataset. And this can be pretty prohibitive when your dataset is too large. So motivating this work is how do we scale Markov chain Monte Carlo Osteria in France to massive datasets. And over the past decade, this template solution has been emerging, which is approximate CMC with subsect posteriors.

01:53

The idea here is that instead of running your exact Markov chain, you'll approximate it in a way that only uses a subset of your data to draw every new sample point. The benefit is that you have reduced computational overhead. So it means you have faster sampling and you can reduce your Monte Carlo variance quickly. But there's also a drawback, which is that it introduces this bias. That's awesome.

02:15

It doesn't go away. And the reason why is that your target distribution is no longer the stationary distribution of your chain. So you've that you sample forever would always be sampling from the wrong distribution. But the hope is that for the fixed amount of sampling time that you have in practise, the reduction in variance that you've introduced will outweigh the bias. And this will be lead to better inference.

02:38

So I think this is a promising approach, but it introduces a number of new challenges for inference. For instance, how do we compare and evaluate the samples that are being produced by these various approximate CMC procedures? How do we know how good they are? How do we select an approximate M.C., M.C. sampler and how do we tune the type of parameters? Because they're always hyper parameters that you need to tune. And then how do we quantify this bias, Ference Trade-Off that I just mentioned?

03:08

So there are standard procedures for CMC assessment, like effective sample size trace plots. Diagnostics. But they all assume eventual convergence to the target distribution. And they don't account for this additional Assam Toric bias that's inherent to these approximate CMC methods. So in this talk, I'll introduce to you some new quality measures that are suitable for comparing the quality of these approximate CMC samples. And a bit more generally.

03:42

We'd like to develop a measure that's suitable for comparing the quality of any two samples that are approximating a common target distribution. So here's here's our setup for the talk. We'll be focussing on continuous targets. P Target Distribution's P with support and all of our to the D. But you can relax that to any convex upset and we'll say that our targets have a density little P that's known up to normalisation,

04:07

but that integration under the distribution is intractable. So to approximate expectations under P., we're going to make make use of a sample. So we have sample points X1 through XM and. Together, these define a discrete distribution, which is just the empirical distribution that places one over end mass at each point. So we'll call it distribution cubin. And what we will do with Cunard's will approximate integrals under P. Expectation's under P. With sample averages under Q in blue.

04:41

And it's I should say at this point that I'm not going to make any particular assumption about the provenance of these sample points so they could be samples from your favourite approximate Markov chain, Monte Carlo algorithm, but they could also be deterministic points, for instance, produced by quadrature rule, or they could be iodines points from some auxiliary distribution. OK, so given the sample. Our goal is to quantify how well, sample expectations, approximate target expectations.

05:09

And we have a few criteria for doing this. First, we want are our quality measure, our measure. A way of quantifying. To detect when our sample sequence actually is converging to the target. And we also want this tact when the sample sequence is not conversion to the target. And we want to do all of this in a way that's computationally feasible so that you can use this quality measure to assess your Markov chains.

05:37

OK, well, given these goals, I think it's natural to consider the family of integral probability metrics because these are measures that quantify the maximum discrepancy between sample and target expectations. Overclass of real value test functions. Call that H. And when is classes sufficiently large convergence of your if your integral probability metric, I'll call it IPN for short. So convergence of your IPN to zero implies that your sample sequence is converging weekly to P,

06:08

and that's good because that satisfies our second requirement. Detection of non convergence. And there are lots of examples that you may be familiar with that fall into this, this family, the L1 battisti in distance the deadly metric, even total variation. But for for vast distances and to end the deadly metric there, there's a problem, there's a problem for our for our criteria. And the problem is that the definition of the integral probability metric involves integration under P.

06:39

And that's exactly what we don't know how to do. So most of these IP memes that you're familiar with can't actually be computed in practise. So how can we get around this? Well, one idea is that we could only consider functions that we know test function to age, that we know apriori are mean zero under P. If we could do that, then we'll get rid of this integration under P. And then the IPN computation only depends on our sample.

07:08

And that's potentially something that's tractable, something that we can compute. That's an interesting idea. But how do we find test functions that are mean zero under P. And. Once we find them, how do we know that the resulting IPN will actually track sample sequence convergence in the way we wanted to? And then finally, even once we've gotten rid of the expectation under P, we're still stuck with this infinite dimensional optimisation problem over test function.

07:38

So how do we solve that in practise? So come back to the third question in a in a bit. But first, I'd like to show you how we can use ideas from Psycho Stein's method to answer these first two questions. OK. Stein's method, I'm sure many of you are already familiar with this, but you know, Stein, Charles Sharfstein in the 1960s and 70s produced this recipe for controlling convergence and distribution.

08:06

And he did it to provide a central limit, to prove a central limit theorem and to provide rates of convergence for the central limit era. And I like to think of it as a three step approach. So in step one, you want to identify an operator, what's called a T. and a set of functions G that have a following property. When you pass functions from your set through your operator, the result is means zero under your target. So this operator generates means uro functions under P, which is great.

08:37

That's exactly what we want for our R improved IPN. OK, so say we've picked this operate and we pick the set, judges are going gonna be the domain of the operator together, TNG will define what will will define what we'll call a style discrepancy, which is an IBM type measure that has no explicit integration under P, because by design, all the test functions are means you're under P. So call it sign discrepancy S. And it's a function of your sample.

09:07

The operator you picked and the set you picked. Now, I'll come back to how we actually find such an operator in a moment, but first I want to show you what we'll do with it in Stine's one, what one would normally do with it. Stine's method. So the second step in science method is typically the lower bound. This time discrepancy by some reference IPN that you know and trust like a stay stand distance.

09:28

And the purpose there is to show that if you're STEINERT 70 goes to zero, then your your faithful IPN is also going to zero, which would mean that if your Stine's comes, he goes to zero. Then you know that your sample sequence is converging to P, and that's great. That solves our our second requirement detection of non conversions. Now, typically, this requires analysis, but it can be performed once at once in advance for large classes of target distributions.

09:56

And then you can go off and happily use your time discrepancy knowing that it controls convergence in this way. The third step in science, maths, that is typically the upper bound herstein discrepancy. By any means necessary to demonstrate convergence to zero. This is this would satisfy our second our first requirement. Now, this for us is probably the least interesting step, because in practise,

10:18

what we're going to do is actually compute the tiny discrepancy and see how big it is. But it would be nice to know that. Nice to know, Apriori, that if your sample was converting to P, your time discrepancy wouldn't d go to zero. So I'll give you some results of this flavour as well. Now, the standard used for this recipe is as an analytical tool to prove convergence and distribution, to approve central limit theorem. Just elif rates and the central limit theorems.

10:43

But our goal here is to develop it into a practical quality measure that you can use to assess, for instance, approximate M.S., M.S. So to do that, I'm going to have to give you a very specific instantiation of this otherwise abstract recipe.

10:59

So let's do that. Let's first pick one of these operators, we'll call it a Stein operator T. Remember, we want to find a T that generates mean zero functions under our target distribution P. Well, to do this, we're going to appeal to this beautiful idea due to Andrew Barr or called the generator method.

11:16

And the idea is that if you can find a Markov process that has P, is it stationary distribution and under very mild conditions, the infinitesimal generator of that process generates mean zero functions under P. Now, I've given you the definition of an infinitesimal generator here, but we don't actually need that definition. We're only going to focus on one specific instance which I've written in my my green box in the bottom.

11:41

So if we pick a particular Mark-Up process, this is the land of diffusion targeting P. Then I've written down the form of its generator and from the generator we're going to produce an operator. And here's the Stein operator. We're going to use a called T.P Land of Einstein operator. And the main thing I want to highlight what this operator is that it depends on the target, P. Only through the gradient of its log density.

12:06

So this means that this is an operator that could be computable, even if P can't be normalised, even if you don't know that normalisation constant. Those familiar with Stein's method might also see this as a multivariate generalisation of the density method operator due to Stiehm. OK. So that's going to be our working Stein operative for most of the talk. We need to know whether it produces mean zero functions. For that, we need to actually choose which input functions G to pass through it.

12:36

And it turns out that one suitable set is the set that have called the classical style set, which is just a set of functions that are bounded, that have bounded gradients and that have lipshitz gradients. If you pass any of these functions to your operator, the result will be mean zero, which is exactly what we want. So now we've picked our operator, T. And we've picked Stein, said G. Can I just ask what does the storm mean in front of the top of the No.

13:05

Oh, yeah, so sorry. So this is, you know, the status Perama tries by some norm, which could be whatever norm you want, your favourite norm. And then this normal star just represents the dual norm for that norm. But. Most of the talk. You should think of the norm as the Euclidian norm and then the the norm, Starr is also the Euclidian norm that's there. So you can basically ignore the star, but it's more generally the dual norm of the normal person.

13:30

Okay. Thank you. OK. So together, this choice of operator TI Incentive strategy produce a Stein discrepancy. For the purpose of a talk, I'll call that a classical style discrepancy. And now you need to know, does this time discrepancy actually detect convergent and non convergence in the way that we want? So what does that mean that means. Is it the case that this time discrepancy goes to zero?

14:00

If and only if your sample sequence converges to be. Now, in the univariate case, this is actually a well-known result for many targets. It's known that for many targets this time, discrepancy goes to zero. Only if the Bosher Stein distance goes to zero. Here are some references that provide results of that flavour. But in the multivariate case, relatively few targets have been analysed.

14:25

There's been substantial work on multivariate Gaussians. There's been some work on the Dirichlet distribution and a few other targets. But many targets haven't been assessed in the same way. So to extend the breadth of the literature, my collaborators and I proved the following result.

14:41

And it says that if the land of diffusion underlying the opera that we picked off, the land of diffusion of your target couples' an integral rate, that means basically that mixes quickly and great into your log density is Lipshitz. Then we have exactly what we want. You're Steinman's reference. He goes to zero. If and only if you're for sustained distance goes to zero. And so what are examples of distributions to satisfy these properties?

15:05

Well, any strongly law can cave distribution falls into this family, including the bagian logistic regression with calcium Pryors that I mentioned at the beginning of the talk. But it also covers non-striking often kape distributions like robust students to regression with calcium Pryors or even calcium mixtures which are multi-modal. Now, I should say, though, that this class is one class for which we can provide such a result. But these conditions are definitely not necessary.

15:29

But we hope that this sort of result will provide a template for bounding style discrepancies for other classes as well, so that we can expand the reach of this style discrepancy use. OK, so let me give you a simple example of using this time discrepancy in practise, the simplest possible example.

15:50

So here your target is just going to be a standard normal distribution and you're going to observe samples either from the standard normal or from a student's t distribution with matching mean and variance. So if you're assessing these two samples, what would you like to see? You'd like to see is that as my sample size grows and I'm getting samples from the I'm getting a sample from the correct distribution,

16:14

then my sign discrepancy goes to zero. And that's what you're seeing in this red line as a function of your sample size. You're seeing this style discrepancy value that was computed. And indeed, this is going to zero at one over root and rate. But now, if you receiving draws from the wrong distribution, you'd like your sign discrepancy to be bounded away from zero and that's what you're seeing this to line as I observe draw.

16:34

So my students t first a sign discrepancy decreases, but eventually it bottoms out because you're not getting a better approximation to the normal. OK. So that's something. I mean, we get the vet, we get the value back from a side discrepancy, but it turns out that on top of recovering the value, you can also recover the function that function G that optimise this time discrepancy, that worst case function. And I'm showing you those worst case functions here on the right.

17:00

I think even more interpretable than a worst case function G is the function after you've passed it through your Stein operator, because that becomes the test function. That should have been means zero under the under the target, the Gaussian. But isn't mean zero under your sample. And it's the thing that it's the function that has the worst case violation as a mean that's farthest from zero. So. It's I don't over interpret any of these examples.

17:26

I'll just give you a little indication of what you can read off from such a from such a function H. So on the bottom right corner here, you see the worst case test function for our students t sample. And what you see is that there's a lot of mass, a lot of large function values in the tails. And this is because the students t is oversampling the tails relative to a Gaussian.

17:47

And so that's where you can exploit the difference between your sample. That's we can most exploit the non gassy entity of your sample. OK. Here's a more interesting example. This is a posterior inference example, so our target now is going to be a posterior density formed by a prior product of a prior and a number of likely good terms from a from a from our EL datapoints.

18:16

To in this case, we're going to be assessing a particular algorithm called stochastic gradient legend dynamics, which is an approximate CMC algorithm. It it operates. It does a random walk through space. The way it does this is by starting at a certain point, taking a step in the gradient direction in the direction of the grade if your log density. But if your data set is large, computing grid Inverloch log density is expensive.

18:40

And so it approximates that using a mini batch to randomly subsample a set of your points and a stochastic gradient instead of a full gradient. Then add some Gaussian noise and that becomes the on the next step in your Markov chain. Now, you might see this as an approximation to a few other albums with what you might be familiar. One is called the Metropolis Adjusted Land of an Algorithm. So this approximates that in two ways.

19:05

One, instead of using the actual gradient of your log density, it uses this stochastic gradient. And two, it throws away the Metropolis Hastings correction. So the correction is usually employed to ensure that you're Markov chain is indeed targeting your target distribution, that it has the target as a stationary distribution. But computing the Metropolis Hastings correction is usually very slow for large datasets.

19:26

So this algorithm just throws away the Metropolis Hastings correction and hopes for the best. This is what makes it approximate CMC. OK. Well, one other we're doing approximate NCMEC, a number of things matter a lot. And one of those things is the step size. This is how large of a step you taken. Your stochastic gradient direction is Epsilon. If your step size is very small, then after a fixed amount of sampling time, you'll barely have moved from your starting point.

19:53

So that's slow mixing. So that's bad. That's going to give you a bad approximation to the target and the target I've drawn here in terms of it's drawing the equity density contours of the target distribution here for reference. On the other hand, if your step sizes too large, then you'll have fast mixing. But you'll be mixing very quickly to a very wrong distribution. Here you see that the sample is greatly over, dispersed relative to the target.

20:20

So this is a problem. You like to pick something in between. So how do you do that? Well, you could try to use a you would like to. OK. We like to pick this step size. So do we do that? You could try to use a standard selection criterion like effective sample size. This is commonly used to assess pilot change for Mark-Up, for CMC and to pick hyper parameters. But in this case, effective sample size is really just a measure of variance.

20:48

It's a measure of auto correlation amongst your sample points. So it will, in this case, always pick the sample that says here's here's my measure, effective sample size. Actually, I've assessed the effective sample size of different values of my of my step size parameter epsilon. And it basically just says, pick the biggest epsilon you can because the variance looks the best. So it says, pick this one. And the reason why it picks this one is that it's not.

21:11

It has no way of accounting for bias. It doesn't know that this is targeting the wrong distribution asymptotically. So it can account for the fact this is over dispersed. An alternative would be to use this time discrepancy that I just mentioned. And here I'm showing you the Stein discrepancy, values computed for each step size here, a lower is better. So the sign discrepancy says, OK. A very small epsilon is giving you a bad approximation.

21:35

A very large epsilon is giving you a bad approximation, but something in between is giving you the best. So pick this value and the value that it picks is this one in the middle, which you can see here, certainly from these plots, is like the best approximation to our target. OK, so one thing I didn't mention about the sign discrepancy that I just evaluated for you was how I computed it. And it turns out the way a computer does by solving a linear programme,

22:07

there's a particular programme that you saw that gives you back the value of the sign discrepancy. And that's great. But maybe you don't want to carry it. Maybe you want to solve a linear programme every time you have to evaluate your sample. You might find that cumbersome, too slow. And so now what I'm going to do is I'm going to identify an alternative, Stein said. That is a bit more user friendly, just easier to use than the one I just described.

22:29

And to do this, I'm going to appeal to reproducing Col's. And I want to build upon an idea of Chris Oates Girolami in Japan, who used means zero reproducing kernels to improve Monte Carlo integration. So just a quick background on reducing colonels. Colonel, do you think of it as a function with two arguments?

22:53

It's symmetric in those two arguments and it's positive semi definite meaning that if I took any points from my my sample space and I formed the matrix of pairwise evaluation to that function, then that matrix is always a positive, something definite matrix. So what are some examples? Probably the most common example is what's called the Gaussian kernel, which just looks like the W of a Gaussian distribution. And so that will be appearing in our talk in a second.

23:20

Just to give you a second example, here's an inverse multi quadrio kernel, which looks a lot like the Gaussian kernel of its decayed polynomial. Instead of squared exponentially. OK. What we actually want from this colonel is the space that it generates, every colonel generates a what's called a reproducing Colonel Hilbert space. And we're going to use that to choose a different style set to pass through our operator.

23:46

In particular, we're going to use this sign set, which some call the Colonel Stein said, and this is just the set of functions. So the function we've been using throughout the talk, we're actually vector value. They have de components. So this is a set of vector value functions such that every component belongs to your R.K. Yes, you're reproducing Colonel Hilbert space and the norms of those functions are jointly bounded by one.

24:09

So every function component belongs to RCA. Yes. And then you put all the norms together. They're about to buy one. That's what I said is. What's good about this set? Well, the very nice property that this set has is that this tiny discrepancy that he uses and that is you plug this into your Stiehm discrepancy. It has a closed form solution. And to form that solution. All you have to do is take your input, Colonel Kay. You pass it, you apply the Styne operator to each argument of Kay.

24:40

That gives you back a new kernel, which is what we call a Stein kernel. And then you sum up those Stein kernels across of every pair of sample points. So, again, the current this the resulting Colonel Stein discrepancy is just a pairwise, some of pairwise evaluations of this new Stein colonel that's formed by applying Herstein operator to your base. Colonel. So that's pretty cool. Clothes form does take quadratic time in your sample size.

25:10

But it's also very paralyser able so you can evaluate all of your pairwise evaluations in parallel on a cluster or across course. OK, so now we have another time discrepancy and then we have the same questions. Does it detect? Does it detection on convergence, for instance? Is it the case that your sample sequence converges to P. Whenever you're Stiehm, discrepancy converges to zero. Well, it turns out that in the one dimensional case. This is true and pretty broad generality.

25:41

If you pick any of your favourite kernels like that Gaussian kernel or the inverse multi quadrant, then if you're set, if your target distribution belongs to basically the same set of distributions that we consider before. Here I am. I've changed the name, but it's basically the same set as before. If you're if your target belongs to the script at P, which is the set of targets with Lipshitz grab log density and that satisfied distance, strong lock and cavity.

26:08

This is a property that's kind of like being strongly Loken K, but it only is only a measurement of the tail, so it doesn't have to hold locally. So even Gaussian mixtures that are multi-modal satisfy this property. If people longs to the set of target distributions and you pick your favourite Kernell favourite, we'll say universal kernel, then your sample sequence converges to P. Whenever your state discrepancy goes to zero.

26:32

So in terms of targets covered, it's all the same targets from before Bayesian logistic regression students t regression with calcium prioress, calcium mixtures, chemical variance, they'll fall into this class. And this also justifies the use of Casti with popular kernels like the Gaussian kernel. Another popular it is called the Materne Colonel and even the inverse multi quadrant that I mentioned before. But this result is a little bit limited because this is only for one dimensional targets.

26:57

And often when we're dealing with CMC, we're working with multidimensional targets. So what can we say in that case? Well, it turns out that in higher dimensions, many cases break down, they actually fail to detect an on convergence even when you're working with a very easy target like a Gaussian. So here's the result. It says if you're if you're in dimensioned three or greater and you're the colonel you pick has a very rapidly decaying tail.

27:26

So it's a Gaussian kernel if some a turn kernel which decays exponentially, if it's one of these inverse multi Quadro kernels that the case poorly normally. But at the polynomial, the K is too quick. Then I can give you a sample sequence that will send your sine discrepancy is zero but not converted to P. Which is actually really bad because that means you can't trust your style discrepancy.

27:48

And in fact, the sample sequence doesn't converge, the sample sequence that I can construct will converge to anything. It really just places a bunch of mass on the surface of a sphere and then shoots the radius of the sphere off to infinity. So your sample is not converting to anything. It's just shooting all the mass off to infinity. And yet your time discrepancy is still going to zero. I don't trust you with this. So when you say not converging, so this is converging in distribution, right?

28:16

Yeah, that's right. Secondary in distribution to anyway. That's right. There's also a convergence on compact sets, of course. Yeah. So, I mean, the real issue here is that you're sending all of your mass to infinity. All of it. And so it's essentially converting to the zero measure. Yeah. So. So in that sense I suppose. And they converge at the foot, converge because on any compact that you're left is nothing.

28:42

Yes, but not to any problem. We won't convert any probability distribution. That's true. Yes. And that's that's, I think the most part about. But yes, it will convert. It is converting to the zero measure, as in convergence. But that is OK. I think this is the problem. That is the whole problem, actually. That is the whole problem. First time currently to work in the way you want it to.

29:00

You have to be able to distinguish the zero measure from Pete because the zero measure also has means zero functions. And so to speak. Yet to build a Singlish, those two things. And the problem is that if you're Col's decaying too quickly, the these very light tails are ignoring that excess mass that you're sending into the tails. And so it doesn't notice that you're shooting all the mass off to infinity. Essentially, you're not controlling type this. So that's a problem. So what can we do?

29:32

Well, I just said, well, the problem seems to be these like tales they're ignoring massive, if any. What if we use heavier tails? And it turns out that that is enough to solve the problem. If you use one of these inverse multi Khwaja kernels, these Pawling O'Meally, the cane kernels, but you have a cane not too slowly if you choose them. If you choose the decayed parameter between to be between minus one and zero, then everything works.

29:55

If you're if you're distribution belongs to our standard class and you use this. I am to Colonel. Then indeed if you're Stiehm discrepancy goes to zero, your sample sequence is conversion weekly to P. In any dimension. OK, so let's go. We have a solution. Now, what about the other side? What about detecting convergence? We want to show that if my sample sequence is converting to P, then my sine discrepancy is also going to zero. Well, turns out that this is true and pretty broad generality.

30:27

If your kernel is bounded and twice continuously differentiable and your grab Luppi is Lipshitz and Square enterable under P. So that's a pretty mild conditions. Then your stand discrepancy will go to zero whenever the Bosher stand distance goes to zero. So if you're Cubans converting to be embarrassed in distance, your tiny discrepancy will also converge to zero. And this covers all of your favourite kernels in them in any dimension. OK, so let me give you an example of using this.

31:01

This kernel style discrepancy, now we're going to consider another one of these approximate CMC procedures. This was designed for scalability. This one is called stochastic gradient fissure scoring, or STF s for short. And in the initial paper, there are two different variants of this algorithm proposed. One that is more of an exact variant that involves inverting a DVD matrix. Another that tries to reduce computational effort by converting into agonal matrix instead.

31:28

We're going to operate on a fifty one dimensional problem in this case, using a post here distribution over parameter Bayesian logistic regression posterior using monist handwritten digit images. OK, so here's the result, basically, we're comparing two different. I said we're comparing two different versions of the CMC algorithm approximate I'm CMC algorithm and teal and red. If you apply the same discrepancy, you see that across sample sizes.

31:59

The full version, the F version just seems to be better, has a lower and a lower style discrepancy value. So this would suggest that you should always use that unless the speed up from using the diagonal version as much as very large. But it turns out that the speed up from using the diagonal version is quite mild. No differences like point o one seven seconds versus Ponto and nine seconds. Not much of a gain.

32:21

So the quality this would suggest that the quality that you're losing by using the diagonal version is not worth it. Not the speed up you're getting is not worth the quality being lost. But that's just the output of this discrepancy measure.

32:33

You might want to might wonder, what does that have to do with reality? So to give you an external cheque on this, Amun, this assessment, what I've done is I've run a very long and trustworthy Markov chain to generate a ground truth sample from the posterior. In this case. I believe it's either Hamiltonian Monte Carlo or in Metropolis suggested it led to an algorithm. So this is a Markov chain that actually is targeted at the distribution, run it for a very long time.

33:01

And I've used that to as a surrogate surrogate ground truth for the true posterior. And then I'm comparing the meat, the produced means and co variances from our approximate CMC algorithms. So in the left column here, I'm showing you the the the best meaningful variance approximation across pairwise marginals and in the right column showing you the worst. So on the top, we have the diagonal version of the algorithm and at the bottom we have the full version of the algorithm.

33:30

What you basically see is that even in the best case, the diagonal version is doing a poor job of approximating pairwise marginal variances. And even in the worst case, the full version of the album is doing a great job at approximating equal variances. So it's just an external cheque on the on the quality of these to approximate. CMC albums. And indeed, just by looking at pairwise marginals, you can see that the full version is doing much better than diagonal version.

33:55

The benefit, though, of using this time discrepancy over what I've just done on the right is that you don't need this extra auxillary surrogate ground truth chain to perform your assessment. OK. So so far, I focussed wholly on sample quality comparison. And I like to highlight a few other application of these tick through the application of these techniques.

34:22

So as an example, a few groups have proposed using Colonel Stein as references like the one I just described, to perform goodness of fit testing, in particular to Will Koski at all use the Acasti with a Gaussian kernel to carry out a variety of goodness effect testing experiments. The authors of this call. All right, Tom. Hey, Arthur. Arthur is part of that, too. Arthur Gretton is a great idea.

34:51

And one thing they found was that the if you use a default Gaussian kernel, you could experience a considerable loss of power as your dimensioned D increases. And, you know, I think this is in part related to the problem that we highlighted before with a Gaussian kernel being unable to detect, unable distinguish the target from the zero measure in higher dimensions. So what we're going to do is we're just going to repeat their experiment, plugging in the I am Q kernel for the Gaussian kernel.

35:18

OK. So here it is in the setup. Your target P is a multivariate standard Gaussian. You're seeing a sample that's not Gaussian. And you want to test whether it is Gaussian or not. And specifically, a sample is formed by spiking one of the components with a a uniform random variable. And so you want to test the power of this? We want to compute the power of this test for the testing, this departure from normality.

35:46

And they also compare with the standard normal normality test of bearing House and Enza. So what do we see? Well, we see that in low dimensions, the Gaussian. All these tests, the Gaussian KSTU, the Barenholtz and hence the test. They all do quite well. They have full power at detecting these departures. But as a dimension increases, you see that the power of the Gaussian test decays very rapidly.

36:05

By the time you're in dimensioned twenty five, you've lost all of your power. But if you just substitute this, I am. Q Colonel with polynomial, the k the one that we describe that controls convergence for the Gaussian kernel. You actually maintain full power across the range of dimensions. OK. So another thought is that, OK, we've we've looked at. We're assessing sample quality. What if we could we also improve the quality of our sample? You've you've generally a sample somehow.

36:34

I wanna produce a better approximation to my target. Well, one way of doing this might be to assign a weight to each of your sample points and then optimise those weights to minimise your Col's time discrepancy. This is the idea. And a paper by Lou and Lee in twenty sixteen. And they again do this with a Gaussian kernel. And we're going to show you some benefits from doing this with an IQ kernel instead.

36:59

So what I'm showing you here is the main square area between the true means and the estimated means, using a sample, using a sample, a weighted sample or original equal weight sample. So your initial sample is just an I.D. draw from your target. Then we're going to learn a re waiting using a Gaussian. I'm q sorry, I'm guessing KSTU guessing style discrepancy. If you do that, you see uniform improvement in an approximation quality across dimensions from dimension two up to one hundred.

37:29

And then if you use an Q colonel and said you see yet further improvements, approximation, quality. OK. So you might. So in this last example, we were improving sample quality by re weighting points. You might also imagine that you could improve sample quality by shifting locations of your points. You start with the initial sample and then you move those points around to improve the quality. And this is the idea of Stein variation of gradient descent. You know, Leo and Wang.

38:04

Essentially, it uses the colonel's time discrepancy directions like the worst case test functions from Colonel Stein's Pepsi problem to update locations of points. And it turns out that you can also view this update as a as a as an approximate gradient step in the Khail divergence between your sample approximation and your target distribution. Now, the conversion, the organ is not completely clear, but it's a very it's very simple to implement and seems to work well.

38:32

One other possible downside is that the update also costs N squared time, so to update your set of sample points, if you're updating all of your points on every step. And that takes anywhere time per update. Another approach and other related approaches. Use your style discrepancy to greedily select the next points. So start with a start with no approximation. Pick the best single point approximation to your target, a Winterstein discrepancy.

39:01

And then repeatedly add in additional points to improve that approximation. That's another that's another way of selecting Denovo points for proximity. Your target distribution. Different from just sampling or using CMC. And you know, we can show that this actually sends your case easy to cast you to zero at a particular rate that a square log end over end rate.

39:23

So you can prove convergence of this algorithm. One possible downside is that to pick each new point, you have to solve this optimisation problem. You have to. You have to solve this to pick the next point, X, X and you this all this optimisation problem over to the T. Which could be expensive or difficult. So we proposed a modification which replaces that expensive optimisation with just sampling from a short Markoff chain.

39:48

And it turns out if you have a Markov chain that's mixing quickly anyway, instead of optimising over all the space, you can just generate a short Markov chain. Pick the best point from the iterates of that chain. And use that as your next. The next point in your representation. And then it maintains the same rate of convergence. So here's an example, what those sample points could look like if you're doing em CMC. Typically the problem, the the lack of approximation quality comes from clumping.

40:20

Coupling it with clumping that comes from randomness. So you'll have to see places where you have your oversampling in places and also large gaps just due to randomness. Whereas the Stiehm Points CMC algorithm typically spaces out your points to minimise redundancy and they give you a more compact representation. So I'm showing you some examples of this algorithm against some alternatives.

40:50

This one is from a kinetic model of automatic control. So this is a 10 dimensional posterior distribution from the parameters that involve solving. So the likelihood here involves solving an O.D. And so you'd like to minimise the number of those likelihood evaluations. And so what I'm showing you is as a function of the number of about likelihood evaluations, how good is my approximation to the target distribution using a variety of different particle generation procedures?

41:21

So comparing Markov chain Monte Carlo algorithms to S PGD appear in red to this nine points. I'm CMC algorithm, which are these solid purple and blue line to the bottom. So. The dash lines are NCMEC. Yes, these dash lines are MSNBC. Red is as PGD. Essentially, what you're seeing was having with FSU disease because every update ticks end square. Time it progresses slow. Eventually, it's going to converge to a good solution, but it just takes much longer than these other procedures.

41:54

CMC does real relatively well, but you can actually get to better quality more quickly using these Stiehm Point M CMC procedures. Here is another example from a Financial Times series. I'll just skip over that. OK, so I've highlighted a few applications and some of the technology that's useful. I wanted to end with some opportunities for future development areas where I feel like more work is still needed.

42:24

So first, improving scalability of style discrepencies while maintaining convergence control. So this tiny discrepancy itself involves this great into your log density where the gradient, your low density is too expensive. Can you still monitor convergence in a way that's both computationally feasible and theoretically sound?

42:46

My collaborators, I have started doing some investigation into this. We have some new work this year called Stochastics Stein Discrepencies, which essentially just takes you replaces the GRAP great Inverloch entity with a random stochastic gradient. And we show that if you do this, you can still control convergence with probability one. So that's pretty exciting. But I think there's still much more to be explored in that space.

43:12

If you're using one of these Col Stein discrepancies, I mentioned that the expense is quadratic in your sample size and that might be to be prohibitive. Your sample size is quite large and there's been a good bit of work on this now and trying to improve that. So Arthur and Collaborator's have work on what are called finite sets time discrepancies, which essentially are using low rank kernels and provide linear runtime.

43:36

But there are still questions around the convergence control of these of these discrepancy measures. With Jonathan Huggins, I've worked on a different type of style discrepancy, some of which are actually, Colonel, some discrepancy, some are related, but they're actually they use different function classes. They're called the random feature style discrepancies. And you can think of these as providing essentially a stochastic lowering kernels that have near linear runtime,

43:57

but also provide high probability conversions control. But there there's still some caveats to establish that convergence control. We needed to assume something about the moments of your approximating approximating samples that to be uniformly bounded. And so it'd be nice to get away from that requirement and just have something that works all the time. Now, I also proposed this particular operator to you that came from the Lensman diffusion call called the lens of an operator,

44:24

but actually an infinite number of operators characterise the target distribution. So how do we know which one we should pick? What is the best operator and how does that impact your discrepancy? And what's practical? These are all open questions. But I can tell you, I can give you example why this matters. If I give you a class of distribution's for which the same discrepancy definitely does control convergence,

44:46

but here's an example of distributions for which it does not. If you're dealing with, say, a student's t distribution. So the grading your love density is bounded and you're using any kernel, any C0 kernel like the Gaussian kernel or the item. Q Colonel Kernel, anything that's standard is zero. Then you're Stiehm discrepancy will not detect an unconverted, will not control convergence.

45:06

In particular, I can always produce a sample that will stand your sign as revenue to zero, even though the sample is not converting to P. That's bad. It's the same sort of problem that we saw in load in low dimensions for certain kernels. But now this is happening for all the cartels. All your favourite colonels and every dimension.

45:29

So what can we do about this? Well, my some of my collaborators and I like Andrew Dunkin's, Vashion, Volmer, Jackson, Gore, and have been investigating alternative operators called the Fusion site operators.

45:40

This is essentially the idea here is that instead of using Relenza and diffusion, use a different diffusion and you can pick the diffusion in a way with that counteracts the decay or lack of growth of your of the grab bag density so that you can still detect convergence even if you're dealing with a heavy tail distribution. So that might be one solution, but I think there's still lots of open questions there.

46:06

Some groups have been using these standards committee ideas to train generative models like generative adversarial networks and variation of auto coders so that you can generate new instances of of different classes of of different data distribution. So in this case, it's generating new celebrity faces. You can also use these ideas to develop algorithms with with provable, finite sample guarantees for optimisation.

46:40

Even Vernon convex optimisation. And the basic idea here is that if you're going if you want to minimise a particular function, f you're going to do this by trying to sample from what's called a gibs distributions. You're going to take a target distribution. Make it E to the minus F or me to eat to the minus F type some scalar. And you say, like, can I sample from a distribution. And if you can simplify my distribution and you can increase the size of that scalar at the right rate,

47:08

then you can eventually end up minimising that function. And we found that for fun, even for a class of functions. They're not convicts. But satisfy this other property called distant deceptively. We can actually design Markov chains that will minimise the function with explicit rate of convergence.

47:33

And here's an example of one very simple case of such a function, like here's a function that is not convex, but it's really just because instead of growing convexity from the tails, it actually has concave growth away from the tails. And so if you use an algorithm like gradient descent, it has very little signal about where the optimum is. Even though you seems clear from this picture of the Optimus as you just read zero. Eventually you'll get there.

47:55

But if you do in gradient descent, you're going to proceed actually very slowly. And this black line is showing that we're taking. It takes green and sent about 10000 thousand iterations to get close to the optimum here. If you use the land of an algorithm to try to sample from the gibs, distribution will enter and I'll go and actually go off in the wrong direction because it has poor mixing for distributions like this for for distributions with heavy tails like this.

48:17

But if you take our design diffusion, which is based on changing the grand log density, scaling up the grab low density in a particular way, then you converge to the optimum like fifteen iterations. OK, and there are a number of other tests which people have been exploring using these tools, parameter estimation and models where maximum likelihood is intractable.

48:44

His reference for that thinning the output of your Markov chain in a way that to get a more compressed representation of a posterior distribution is a reference for that, too. And I'll leave you with a slide of future directions and open questions that we have something to talk about. So thank you for your time.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript