Complexity of local MCMC methods for high-dimensional model selection

00:04

OK, so let's start. So welcome, everyone, for this Friday's seminar talk. So this is our second psychoanalyse elsewhere Satchmo seminar this term. So the speaker today is Trengove from Texas A&M University. So I try back in 2008 when he was postdoc at Rice University and he visited University of Toronto, my supervisor, Jeff Rozental, and then starting to start starting from that time we worked together.

00:40

So now he's moved to I think two years ago, he moved to Texas A&M University and now he's assistant professor there. So our collaboration is very incredible and other successful, I would say. So I highly recommend you talk to him if you have some common interests with him. So he worked. He has a very broad research interests other than Bayesian methodology and the computation we will hear today.

01:11

We also work on statistical genetics, graphical models, stochastic control and experimental design. So today he will talk about complexity of a local CMC search for high dimensional model selection. I believe it includes some of our work, also also his own work with his student at Texas. So I will leave to join very much prior to the introduction. Thank you so much for having me here. It's it's really a huge pleasure, a huge honour for me to speak about my research and to this seminar.

01:49

So today I would like to talk about some of my written books on local and Simsim methods for high dimensional models and action. And actually, the talk is mostly based on my and the joint work with Restring and Gareth Roberts, Jeffrey Rosen and vets and the results are based on a joint a joint work with my students. And and I will also mention one results. Well, which is from which comes from an ongoing project.

02:25

So before I start my talk, I would like to share with you guys a story behind the birth of MSNBC. Perhaps a lot of you already know this. So the Metrobus algorithm was first implemented by Dr. Rosenbluth, the woman in this picture in nineteen fifty three, Dr. Rosenbluth. She is actually from Texas. And then and she received her Ph.D. at the age of twenty two.

02:54

And shortly after that, she implemented a Metropolis algorithm. And at that time, well, of course you call it everything in machine language. But after she implemented major obvious algorithm, she left academia and probably never worked on research again. And in two thousand and three. So some way interview with her about about this very first implementation of this algorithm. And she said that she was very surprised that someone still even remembers this algorithm.

03:28

Well, but as as we all know, well, not only do we do we still remember this algorithm, but actually we have been working extensively on the theory and the methodology of seeing the past, that maybe 30 years. Unfortunately, Dr. Rosenblum has passed away at the end of the last year. And this is the obituary posted on the IMC website. And you can find the complete story of her life in New York Times. And I think it's very inspiring, very amazing, very unique story.

04:01

Well, anyway, let's start my talk and let's first quickly review what is the nature of the algorithm. So in this talk will be mostly concerned with a financial stake space, so I'm just going to use this big arse to denote an arbitrary final state of space, and I will use PI to denote our target distribution on this space.

04:26

Let's assume the support of PI is just given by us, so given any transition matrix, we can construct another churnalism Matrix P according to this formula and we all know that, well, just by construction, this new transition matrix is reversible, we suspect. So if further irreducible and aperiodic, then this macro train will converge to PI and if we can simulate a dismal chain, then the samples are generated from this Markov chain can be used to approximate the targeted distribution PI.

05:02

And the and the William Cohen observation here is that to to this market train, we only needed to know and and normalise the of part. And then we can say this P well, defines its algorithm for somebody for one part and the Ferguson Hastings' algorithm. Well, we say this Matrix K is the proposal and this minimum term is called the acceptance probability. The meaning is pretty straightforward.

05:30

So if we want a similar guess, just P then we just well, in each iteration propose a new state according to the proposal distribution, and then we decide whether to accept it or not according to the acceptance probability. And metropolises are reasons and more generally speaking, and Simsim methods have been widely used in the statistics because in statistics very often we are faced with very complicated posterior distributions whose normalising constant is not trackable.

06:06

And henceforth I would I would just refer to Piasta posterior distribution. And in this talk, we are mostly interested in a class of problems called motor selection, and perhaps this is the most important class of statistical problems defined on its creed status basis. And for these problems, the size of the state of space? S typically depends on a parameter P, which is just the number of variables or the number of features we have in our dataset.

06:40

Well, when one example is just the variable selection for verbal selection, the state of space, as is the collection of order of all the binary sequences with less equal to the size of as of these two to the. And another example is a structural problem where we want to identify, well, the so-called thag directly the acyclic wuv underlying our multivariate data.

07:09

And for these problems, the state of space, as is the collection of all the decks with pianos and then as the size of us grows super exponentially with Pete. Well, we will be mostly interested in high dimensional settings where he is much larger than the sample size. So we have to impose some specific constraint so that the model is identifiable. But even if we take into account the specific constraint, in most cases, the size of the size of us still grows super polynomial with.

07:46

So of course, we may wonder, well, then then are the algorithms efficient enough? For these emotional problems, because theoretically speaking, we can always say compare that with approximately messers like bioregional bias or maybe the ABC method and Simms's asymptotically exact, because if we can generate infinitely many samples from MSNBC, we are able to exactly recover the posterior.

08:15

But of course, in reality, especially for emotional problems, we really wonder how fast can these MSNBC algorithms converge? If we if it cannot converge in a reasonable amount of time, then it's probably not that useful. And the this is a very useful property that are very open, we want to verify for our algorithm is called racketed mixing, we say, and the algorithm is rapidly mixing if it's mixing time growth only polynomial with the complexity parameters and the.

08:54

Because because the state of space can grow super polynomial, it was so actually the mixing is a very informative property. It tells us that, well, our algorithm, the complexity of our algorithm grows much more slowly compared to the problem size. And in recent years, there a massive the the local inform that MSNBC has become pretty popular and to explain what is locally informed, MSNBC.

09:26

Well, let's first introduce some not notation. So I'm going to use this an axe to denote the so called neighbourhood of the point X, and it is just a collection of all the points that I can propose given the current point X. And now in practise, I would say that for almost every case algorithm to find out discrete status based, this neighbourhood only grows the size of the neighbourhood,

09:58

only grows polynomial with P and understand the way well to implement and which is to use the so-called random book proposal, which just means that, well, given the neighbourhood, I'm going to propose one neighbouring states randomly with equal probability so we can write K as this one of course is to indicate a function. For Mitchell Hastings, algorithms on final status basis, actually, for any market trends, on final status basis, we can describe the market watching as as a graph.

10:37

And here in this picture, I'm using these black thought to represent the state of my state space. And if two states are neighbours and let's connect them with the law. So if we look at this state X zero, then this X zero has six neighbours, excellent. X two, blah, blah, until X six. And these blue bars indicate well how large the posterior probability is at that point.

11:05

Now, if we use the random wug well, which the Hastings unwisdom and at the point, I'm just going to randomly draw one neighbour amongst these six states randomly with equal probability. And for this distribution, we'll probably we will say that at the point zero, maybe we want to move to X four because the post here mass looks pretty large on the point X four and the posterior mass on the other five points.

11:35

I mean, extra, extra, extra five, six are actually smaller than the probability of X zero. So intuitively speaking, we want to say the best move seems to be from zero to four and the probability of this move is just one six because we are using the random of proposals.

11:58

So it means that, well, on average, we probably need maybe say, around the six iterations to move from zero to four and it is Ristic reasoning also tells us that the mixing time of of the mark of the mixing time of this of this metropolis has this algorithm random its algorithm on this final stage space needs to be at least as large as the neighbourhood size.

12:31

Because because we need to say approximately, for example, six iterations here to move from zero to export or and I am sorry, I forgot to to to to to define what is the mix and how I would define that later. But roughly speaking, makes in time just a process how many iterations we need for the chain to reach station the.

12:54

And then the idea of the law calling for proposals is that, well, since we all things we actually can tell that this X four is is is a much better move than all the all the other five neighbouring states, then why not, let's say, just to propose X four with a of probability. And this is kind of the motivation behind the so-called locally informed proposals. So given the neighbourhood, let's turn our proposal probabilities.

13:25

Let's evaluate the political landscape pie in the the neighbourhood, and then let's propose those states with larger posterior probabilities. So if at so point zero, for this example, we'll probably propose X for the next step if we use locally informed proposals. So here is the mathematical definition of the local informed populace. Let's use F to denote any arbitrary, non-active function and then let's define a new proposal matrix. According to this formula.

14:03

This indicator, function one an just means that we are still considering the same neighbourhood as the previous Random House Hastings. And according to this formula, we are going to propose a neighbouring state, why? With probability proportional to PI over Pontiac's. And that because we have introduced this proposal weights, so we need to. So we need to calculate is normalising constant Zak's on the denominator.

14:35

So this is a novel. So the X is the normalising constant for our informed proposal distribution's. And according to our previous heuristic arguments, we would say that, of course, we want to choose some function that is now decreasing because we want to assign larger proposal ways to those states with larger posterior probabilities.

15:04

But one main disadvantage of this local informed scheme is that so to implement this, inform the proposals, we have to evaluate the posterior landscape pie in the current neighbourhood. We have to evaluate the pie for every Y in the current neighbourhood.

15:25

So that we will be able to calculate this number, I think, consent and then two, and then we can exactly calculate the probabilities, but of course this can take a lot of time, although the neighbour who decides, well, it's it's much smaller compared to the compared to the whole state of space. But if it's really large, then then this local evaluation might take a lot of time.

15:51

And although this well, and I want to say that although this locally informed framework was just recently recently proposed by senator in a paper in 2020, but actually this idea of locally evaluating pie in the current neighbourhood has been widely used in many other measures, and it has also been used in other Nullum same methods. So for all these matters that rely on this idea of locally American Pie, OK, we can ask the following question.

16:31

Can these methods achieve a sufficiently fast the convergence rate that can offset the cost of computing? This nominating council is in every iteration and to the best of our knowledge question has never been rigorously addressed for high dimensional statistical problem. And this is what we are going to do in today's talk. We are going to answer this question for the variable selection part.

17:02

And before I talk about before I introduce our algorithm and and talk about how we do that through the proof or the mixing time, so I would like to first offer some general intuition about this problem. So to be able to analyse the convergence rates of an algorithm for models, model, selection problem.

17:25

We want to we want to know something about about this poster distribution pie, we want to be able to say something about how this pie looks like because the state of space goes super, super enomoto with a very low vitacost exponential rate or even super exponentially with people. So, of course, I mean, we don't expect revenue mixing can can hold it for any for any pie. Right. We have to be able to say something about the pie so that it is possible to prove say something like Rachna mixing.

17:59

And the one thing that is very natural to assume from a statistical point of view is the so-called strong selection consistency for basic model selection procedure. We say that it has strong selection consistency. If for some point X star we have the posterior probability of X Star goes to one in probability with respect to the true data generating process. And of course, this star can be interpreted as the Truman.

18:32

So if we think about the classical asymptotic regime where is fixed and it goes to infinity. Well, such kind of consistent consistency results are very expensive and in the high dimensional settings of things that are actually pretty similar. So all we need, all we need, all we need is that the sample size is not too small compared to P, then we should have such kind of consistency of results. We should be able to recover the true model from the data.

19:03

And if the posterior distribution really concentrates a single best model X star, then according to the Michael theory, we actually know that the mixing time of the enzymes is just equivalent to the hitting time of this Baska model. And intuitively, this seems to be pretty reasonable. And actually it makes a lot of practical sense as well, because if PI concentrates on one that's the model, then, you know, our I guess our goal is just to to to to identify the single best model.

19:35

So hard to prove this strong selection, consistency. So one way is to show that this poster, this distribution pie is using the model and the unique mode is sufficient, right? Shock. And well, I guess that I know that well, probably a lot of you have had some questions about this, about this consistency and this Animoto condition, and I will return to this unit model condition in the last section of my talk.

20:06

But for the moment, I just want to make one remark that is, even if we can prove, even if we can show the poster distribution is your model on the model space, we actually cannot really say that much about how the posterior distribution looks like. So let me explain this point of using some examples. So this is Combivir, the normal distribution with two independent coordinates. And here I only plot this distribution in the first quadrant.

20:39

And it's well, I mean, this is a pretty perfect unit model distribution and why we're seeing unit model distributions. I guess a lot of us are thinking about distributions that look like this. Now, let's consider another example. OK, so so not let the stage space as well, just be the collection of nine points, one, two, three times, one, two, three. And then let's consider this post distribution point.

21:12

And the for this state of space, well, we will just use the most natural network with a relation that is, I will say, two states, two points, our neighbours, if the Euclidean distance between the two points is equal to one. Then using this neighbour who is a relation and if we look at this well, this political landscape, then we actually can tell that this distribution is also model, although it may look a little strange, but actually it is unemotional.

21:45

So we may let's say we first look at this point one one, then the point one has two neighbours, one, two and a two one one. One is not a local model because the point of two, one has a larger probability. The point of two, one is not a local border either, because the point three, one has a larger posterior probability.

22:06

And actually, if we simulated this algorithm for somebody from this poster distribution, then and let's say we we initialise the chain at the point with one that we probably will see that the chain can move along this black pass. It will move from one one to two one and then three one. And then eventually it will reach the mode at the point when three.

22:34

So this so this distribution is also you anymore, but it looks very different from the previous by the normal example and of course we can make it continuous and if we make this distribution continuous, then we will have something that looks like this is very different from from this perfect union model distribution.

22:55

But this is still unavoidable. This is what I mean by why we're saying that even if we can say the Posetti distribution model, we we don't really know that much about about the shape of the distribution.

23:09

It can still be pretty irregular. And although you may say, well, this example looks pretty artificial, but actually I would say for model structural problems like verbal selection, this example is really, really describes this example really gives a pretty typical scenario that we will encounter, because the variables we have high dimensional setting, they are correlated and at the correlation structure amongst these variables can be pretty complicated.

23:41

And I want to make this point because in the literature, very often we we see that, well, people justify why they are always as are useful by considering, say, a high dimensional setting and by assuming that the distribution has independent it coordinates. But I want to say that well, I mean, of course, those those analysis pretty useful. It provides us with a lot of insights into the behaviour of the algorithm. But it's an assumption like an independent coin.

24:15

This is not really realistic in the high dimensional setting, and it is something that we probably can never prove using a statistical theory. And on the contrary, this ultimate assumption is something that we can prove is something we can justify according to the statistical theory, because all we need to require is that the sample size be sufficiently large.

24:39

And the point is that although the point is that although the unipolar distribution may start to be restrictive at the first glance, but actually the possible distribution can still look pretty complicated, pretty irregular, and in that kind of give rise to a lot of difficulties in the theoretical analysis of of our relations on this post here for these posteriorly solutions. And next, well, we will just focus on the verbal seduction problem.

25:12

And now we'll show you what ours and we can propose for the rebels actual problem. And and that will show that our algorithm can achieve a surprisingly fast mixing rate. So let's just consider the standard response. Linear regression model Wilkos to explain that. Plus, Absol, where abstinence represents the normal errors and acts is the bipartisan matrix, where the sample size is the number of variables.

25:40

And because we are interested in the high dimensional settings where is much greater than and so we will impose a specific constraint. So we assume that most entries of batur are zero or at least the most edges of data are negligible. And therefore this verbal seduction problem, I'm going to use Gamma to denote a state of our state of space. And this is the index set of the now zero entries of Batur. In other words, Gamma is the set of variables that have now negligible effects.

26:12

And the goal of verbal selection, of course, is to identify is to learn this gamma from the observed data. And I will use this up as zero to download our model space for this for this unbearable selection. OK, so as zero is the collection of all the subsets of on the integers, one, two, P with size less than or equal to zero. So this is the sparsity parameter and it may grow with it may grow as well.

26:45

Or we can say it with P and actually we, we want to assume this as zero goes to infinity and goes to infinity. So this is the moral space for our crime problem and therefore the verbal seduction problem, I guess we all know that the most classical solution, which is called the ad swap sampler. So for every state gammer in our safe space, let's define three different neighbourhoods and let me use another now and swap to denote these three neighbourhoods.

27:26

So that, of course, it is the collection of all the neighbouring states that that we can move to by adding one covariate to the current model. And that is this is the space that we can move to by removing one Kovarik. And the swap is the space that we can move to by swapping one covariate in the current model with one kolaric that is not in the current model.

27:52

And and, of course, we can track that well, the size of the the size of the additions and the deletion neighbourhood is just equal to P and the size of the small neighbourhood is given by this. And they're using this as a data swap moves, then we can define a random walk Arabism, for example, here we can let's consider this proposal scheme, which probably would have.

28:23

Let's just let's propose one state was probably the one that's just randomly proposed one addition or the move, which probably was probably the one that's proposed one swap move with equal probably. And in this case, this altruism is also coded, symmetric, rather more hastings', because you can track the proposal primarily from Gammer to Gamoran is always equal to the proposal, probably from Gamoran together.

28:54

Although this it looks pretty naive, but I want to say that because of its simplicity, it's actually wider use amongst practitioners, it is pretty easy, pretty straightforward to implement. And the idea behind it is also easy to understand.

29:10

And in a seminal paper published in 2009, 16 Annals of Statistics by Michael Jordan and Malcolm Wainwright, it is shown that in the mile high dimensional assumptions, this this kind of naive, symmetrical, random look which Hastings Armisen is actually rapidly mixing. So the complexity of this random walk algorithm only grows polynomial with only.

29:41

And the author of The Mixing Time Bomb, well, and this mixing compound can be interpreted as the complexity of this random, which is always is this mixing compound is is roughly given by this expression I see roughly here, because I have only I have admittedly some high parameters which are which are not really very important. So it is P an S zero square times and the proof of the mixing compound relies on the canonical pass method.

30:17

And it's kind of the most widely used method for studying the convergence of multiple finite state spaces. And then and the main idea of the approved well is actually just based on that strong selection consistency. So we verify the strong selection, consistency for the purpose of action problem, and we prove that the distribution with high probability will be using the model.

30:42

And then we and then you can use this method. And it showed that while the repetition of the random walk algorithm is actually mixing. But, of course, given that the random walk hours and already performed well, pretty well, that we may ask, how fast can the mixing time of an informed and serious and for verbal seduction be? And we and of course, we want to prove that the missing time of uninformed altruism is much faster than mixing time earth of the random walk hours.

31:19

Otherwise, theoretically, we cannot really say why we should prefer a local law enforcement proposal scheme. So the of the main challenges for this analysis. Well, so I guess there are two main challenges for for for the for the analysis of the reasons for this verbal seduction problem. First of all, although we can say that we are under some high dimensional assumptions, we can prove the poster distribution, the model, and actually gets you in the majority.

31:55

Or they imply that given any say, if we are at a model, which is not the true model, then we can always move to a better model by adding one coverage or removing one COGAT. Here, let me use this star to denote the truth that I know the true model. And I would refer to the cowbirds in Afghanistan as influential cowbirds.

32:22

And the other programmes will be just quoting now, influential. So the idea is that, well, if the current model is missing some influential COGAT, well, then maybe we can use some additional move to increase the positive probability. And if we end, if we have already all the infrastructure upgrades and we have also included some non-commercial commerce that maybe we can remove some Nightwish programmes and then we can arrive at a better model.

32:52

But the problem here is that because of the dependency structure amongst these these Marable's, so actually the post-war landscape of the landscape can still be pretty irregular in the sense that well beyond this, we cannot really see much beyond this other modality.

33:13

For example, although we may, although intuitively, intuitively we may conjecture that if the sample size is really large, then perhaps the poster probably will always increase if we add an influential covariate and the perception of it will increase if we remove a large influential. But actually, in general, these claims are not right. So in general, we cannot say that's adding any knowledge, adding any influential covariate.

33:44

They can increase their support. We also we can say that, well, they are always with high priority. They are always exists when such good means. But we cannot make claims like this. We cannot say any any addition of any financial covariate will increase the percent probability. And this is just because the variables can be correlated.

34:09

And perhaps it is helpful to recall the previous example. The point the point here is that, well, although the poster disputants you the model, but it doesn't mean that the posterior probability will increase whenever we move in a direction pointing to the morgue. OK, so for example, in this example, if we move from one one to one to the actually the probability decreases a lot. So actually here we cannot move in the direction pointing to the mode.

34:40

And that's exactly I mean, this intuition applies applies to our natural selection problem as well. And another challenge here is that we have to choose this locally and from the proposal scheme carefully, because if we choose a category there, very often we will end up with will end up end up with an algorithm that can get stuck at local nodes. So. So let's consider this very naive way of implementing any form of the proposal.

35:18

So given the neighbourhood and which is the collection of all the states that I can move to buy when additions or deletions work, move, let's propose a new state with property proportional to the post area. This is kind of the most naive way, the most straightforward way of implementing a law calling for the proposal scheme. But actually, it's not going to work. It's going to fail completely. So we can consider a very simple scenario. Let's take a start. It's just has two variables, one and two.

35:49

And let's say the current model is the empty set. Of course, then we care about what's the transition probability from moving empty said to to to to the model.

36:00

We just convert one. Or maybe the model is just a collaborative to and then we can calculate we can actually calculated that the transition probably from the empty side to to the model with only cover, the one is bounded by this ratio, by the ratio of the percent probability of of of the single the model to the positive probability of a true model.

36:25

And now if the sample size is sufficiently large that, of course, the true mother will have the positive probability of the true mother will goes to one. So this ratio will this ratio will go to zero. And this will imply that although at all, although at the empty model, we will probably say keep proposing adding covariate one or maybe adding Coverer two, but actually these proposals will almost always get rejected. And in effect, we will just stay at this empty set for for a lot of iterations.

37:03

And the change is not really moving. So this is not going to be a good algorithm. And actually, if we think about this general definition of the local local law enforcement proposal scheme, so we will see that the main reason here is that the main difficulty here is that we actually cannot really say much about about this.

37:25

Normalising constant Zicam. And this is and this, again, is because, well, we are considering, say, a discreet space problem and that the landscape can can be very irregular or more precisely speaking, we can say that the political landscape can change drastically if we just move from the current state to a neighbouring state. Is it's totally different from from from a continuous state of space where we have enough continuity.

37:58

And then we can say that if we move locally around the country, the point the political landscape should have the change mantra. But if a model structural problems for private institutions defined on the screen state of spaces where they can be pretty irregular and it is normalising, constant can change drastically if we just move, say, from the current model to a neighbouring model. And then this creates a lot of difficulty in designing efficient, locally informed proposal schemes.

38:29

So we tried a lot of different ways to to design this local reform proposals. And eventually we found that a single solution was just to use bounded proposal weights. So now I'm going to introduce our algorithm, which we call it locally. Well, it is always in the way of locally informed and the of the proposals and the idea of our algorithm is pretty simple.

38:55

So first of all, let's partition the neighbourhood of Kharma into three subsets, the addition neighbourhood to neighbourhood and swap neighbourhood. And then let's perform the propose a weighting within each neighbourhood separately. And here I'm using this function W two to note that the proposal weight that I assigned to a neighbouring state. And and therefore, every neighbouring state government crime in the neighbourhood. I'm just going to use this bounded proposal weights, so I will wait.

39:29

This proposal using the racial pytka over Obopay, Gummo truncal at both sides. And by properly choosing these truncation thresholds, we will obtain a locally informed algorithm that is very, very efficient, as we will see shortly. And the main result of our paper is that we can actually prove under under essentially the same assumptions used by that paper of young and old, the paper on the random hastings' for the for the verbal seduction problems.

40:08

The mixing time of our algorithm is bounded by sea and with the is just a universal constant. So the point here is that this mixing rate does not depend on the parameter P. And we call this mixing rate dimension free mixing because it doesn't depend on. And the point here is that in high school settings is much larger than so. We care much more about the people care about than the sample size of.

40:42

And if we compare this mixing rate of our algorithm with the mixing rate, with the mixing bomb of that of that random algorithm in the paper by you at all, then we'll see that even after we take into account the taking into account the fact that we need to locally evaluate the political landscape in every iteration. The total complexity of our little algorithm is still smaller than the mixing time bound provided by young adults paper.

41:21

And if you check our paper, you will see that's actually the mixing compound for our algorithm, the rounding our paper is is even slightly smaller than this. But anyway, the difference is not it's not so important end. And I don't want to introduce those heavy notations so legit. So that's why I just presented this simple version of our result. The mixing compound can be targeted by a multiple of which is a pretty well, which arguably is kind of a surprising result.

41:56

And let's before I talk about proof, let's just let's do some say simulation to check whether our algorithms work well in practise. So we first consider the simulation settings of those paper. It's a pretty simple setting where we always say sample the truth that a star such that the size of gamma stock is equal to 10. So they are always just 10 in financial coverers. And we always initialise our algorithms with a random generator, that model with 10 covariates.

42:30

And then when the signal to noise ratio is is sufficiently large, well, then we expect that both random Hastings and a lit a match will be able to identify the posterior mode in a reasonable amount of time. And we observed that actually lit a match is much, much faster than the random walk and match.

42:51

So for uncowed will solve the peak of the five thousand and the four independent design matrix randomly match takes approximately 15 seconds to find the mode and the data only takes about zero point of one second. And if we consider correlated, it is our matrix, then the advantage of pretty much well is is is slightly more substantial actually. But when the true model but if but if we can, but if the true model is the is the model, then we will observe that the landscape tends to be flat.

43:25

And then in that case, random walk. Michala Hastings also performs pretty well and actually it is actually better than that. How much. And and I want to make a point here on the commercial aspects of their much that is well recorded in every iteration of how much we need to evaluate the landscape in the neighbourhood. Right. And this and this actually makes the so-called black walnut black the estimation of Batur.

43:57

Easy to hear this. Blackwall This globalisation means that we can estimate Beda using conditioning. So we look at every inch of data separately. For example, let's look at the first of the paper and less condition, all the P minus one coordinates of the current of the current model gamma condition or that condition on well, whether whether we have selected the other P minus one covariance in our current model.

44:32

And let's see how the posterior probability changes if we include this variable or if we remove this variable. And by doing these conditioning calculations, we will be able to obtain a so-called appropriate the estimate of BEDA, which has a much smaller variance than 10 debate at that.

44:52

We got we just the sample from the fundamental hastings' algorithm and that if we look at say that the performance of this low the of data will see that, then actually it's it's just way much better then than the patents that we sample from the wrong from the random walk out of six hours.

45:16

But I don't want to emphasise too much on the advantages of this validation scheme because it is probably a little bit controversial in the sense that, well, this is we do expect that in general settings, Rob Black on black retardation is always going to help us in terms of the estimation of data. For example, if the Post is really multimodal, then perhaps the problem is not going to help us that much harder.

45:46

The point I want to make here is that when we implement, say, a local form, the proposal schemes, when we do all those local evaluation of the landscape in the neighbourhood, these calculations, although they may look expensive, but they can be made use of in multiple different ways, we can use them to to our proposal probabilities and we can use them to calculate, for example, the estimates of.

46:15

And and, of course, we may say that, well, in reality, we would never have say we'd probably never encountered a union model distribution, we have a distribution that is somewhat multimodal. And then let's consider how our aggressive works for more realistic settings. So for the second assimilation setting, I'm just going to sample this.

46:38

So I'm going to consider a much more complicated, dependent structure for our is not matrix and I'm going to sample 100 influential cagers randomly and that they effect size. I sample from normal distributions. So some so some influential programmes. They may have larger effects, some influential converts, they may have smaller effects. And because we have 100 influential cowbirds and a design matrix has a very complicated, dependent structure.

47:08

So we expect that the posterior landscape will be multimodal. And now for this simulation setting, we still observe that amateurism performs much better than random things and that we can measure the performance of the two algorithms using the effective sample size, well, dividing the body, but by the runtime. And then we see that the effect is in fact a sample size of little eMESH algorithm is always much larger than the sample size of the random book,

47:43

which is ours. And this says that, well, actually our little algorithm is also pretty efficient in terms of exploring, say, a realistic multimodal posterior distribution for the probable selection problem.

47:58

And finally, we also consider the real data analysis which try to give us datasets and analyse approximately 5000 of its peers around the three hundred thousand and the for this very large dataset, we still observe that it performs much better than the random look and the effect of sample size of random book. And it's just about two per minute. But little match has a sample size approximately. So it is three per minute.

48:32

OK, finally, let me just briefly discuss say how we prove our how we prove how we theoretically study the mixing time of our little amateur hours. You have about a minute left for the proof. Thank you very much. Yeah. So we'll use a conditional approach. OK, and and so so this is what I mean by driv the condition. So I guess I need to skip the technical details here. And and I want to say that so recall that previously I was saying that for the random people.

49:15

True. That the mixing rate using the passive message. Right. But and I said that pass measures are kind of the most wanted to use the solutions for Malcolm, change our final spaces. But here we take a somewhat unusual and dreft the conditional approach I see it is somewhat unusual because this condition has been very used in the mixing time analysis of this algorithms, discrete status spaces. It has been drift conditioned approaches have been widely used for groups samples on a continuous basis.

49:51

But let's use just let's try a conditional approach for for our algorithm. And and another kind of unusual feature of our approach is that we actually proved the two different conditions for our algorithm on a state of space. And we would then try to combine these two conditions and then find a missing time bound for our data image outwards. And to explain this to stage of the condition, I want to first partition our state of space.

50:21

So I'm going to classify all the models into two groups, wainscot and the feature models. And the other is called Overfitting. The models, of course, graphical models just refer to the models that have already included all the influential covariates.

50:37

And now we are going to prove one of the conditions for all the overfitting amount of sarin, yet when when you have a condition for all the over models and then we'll prove another condition for all the undertaker models, and then we'll try to combine these two conditions and show that the change is going to be mixing quickly. So this picture briefly illustrates well, the idea of our cruise.

51:02

This is our modern space and the most of the models are undefeated, which are represented by this grey area. And a small subset in this modern space are overtaking the models which are represented by this blue circle. And now what we propose is to stage a condition. Of course, it is based on some theoretical insights into the biggest problem. There exists a lot of high dimensional results for stepwise procedures, for natural selection.

51:31

And those results tell us that if the current model is under fitted, then we can always add some colour to the current model so that we will be able to explain the variation in our response back to Y and the ones we have included all the influential colours and the model becomes overfitting. Then we will be able to remove those nonconfidential colours because then they become kind of useless.

52:02

So this intuition tells us that if we consider the behaviour of atavism and actually also the behaviour of the random look algorithm for the selection problem, then if we start from an unofficial model, then the chain should first say kind of drifts towards the set of all the overfitting models so that we can explain the variation in what. And once the chain arrives at this set of overfitting models, then we will drift towards the true model.

52:37

So this is the the main intuition behind our two stage condition when the commission shows that the train moves quickly from, say, any unofficial model to an overview of the model. And another condition shows that given once we arrive at an overflowing model, then we will quickly move towards the true model.

52:59

I'm going to skip these technical details and the all the paper, you will find some general results for how to efficiently combining to drive the conditions on a state of space and the are paper. We actually derived these results for general status basis. So these results, I mean, they are not limited to two discrete state of spaces.

53:22

And actually, you can also generalise our again, our argument to the case of more than two of the conditions we consider stringent conditions for traffic conditions, et cetera. As long as the state space has such a NASA structure, then you can use our argument to combine multiple different conditions and then obtain a mixing time bomb. So I think I will maybe use another two minutes to briefly discuss some generalisation of the results that I have discovered so far.

53:52

So so this live amateur hour with them and and and all the intuition that I have been talking about, of course, can be applied to other models, selection process. I mean, other than the selection problem and the actually for any general discrete state spaces, we can we can prove some we can prove some useful result for these locally informed and C schemes.

54:23

We can actually prove that for any discrete spaces, as long as the possible distribution satisfies a universal condition, that we will be able to construct a locally informed MSFC algorithm such that its convergence rate can be bounded by a universal constant that doesn't depend on the number of variables.

54:48

So the dimension of remixing can always be achieved for other models of action problems, not only not only the variable selection and that we can always devise such locally informed proposal schemes that can achieve these dimension Friedrichsen rates.

55:06

And then the critical step in all these in the devising of these locally informed proposals and the theoretical analysis of these locally informed MSFC is to act is always to verify this, know the condition you want to prove that what the poster distribution is, you the model, and then you would be able to and then you would be able to use some arguments to establish that that dimension free mixing weight so limited, so limited by time. I guess I'm just going to say quickly wrap up my talk.

55:40

OK, so, so, so today I have been talking about it. Is this this little algorithm, which is kind of a simple but we want to say a pretty efficient solution to the variable selection problem, and it can attain a provable dimension, free mix and. Right. And I and I want to say that for their local informed addresses, the proposal schemes, that the implementation of the proposed scheme can always be easily paralysed.

56:10

So actually, in reality, it's not going to take it can be it can be implemented in a much faster way than than even though simulation results that we have that I have shown and because of the simplicity of our algorithm, you can also combine our proposal schemes with other more advanced techniques such as blocking, tampering, lifting, etc. But whether these whether these advanced techniques are going to further improve the mixing rate or

56:38

the total or whether these skills are going to further reduce the total complexity of the algorithm. Well, I guess it's pretty hard to say and that we need to do a simulation to to to to verify. And this methodology can be generalised to other models of action, as well as we can verify a so-called unmowed. And I use the model condition and the limited by time.

57:05

I cannot talk about the other problems, but in another of my work trying to work with my student, we have verified this during the model condition for this structural problem. So so far, the only problem for the graphical model problem, you can also say device such locally informed and same measures which will achieve a dimension for mixing weight under some high dimensional assumptions. So that's kind of all I want to talk about for you. That's all I want to talk about that.

57:37

That's all I want to talk about today. Thank you much for your attention. I hope we still have time for questions. Thank you for for the great talk. So we still have one minute, perhaps one or two quick questions from the audience. Jeff. OK, so any any questions from the audience? I was actually clapping, not putting my hand up, but I did have it. Thank you. That was lovely talking. Very nicely explained. So I did have a question at first sight.

58:08

It looks like I need to calculate the marginal likelihoods to calculate those parts. Right. So I'm thinking, do you have any suggestions on is there like a is there like a randomised version of this or something like is it enough to have unbiased estimates or something? So you are people we would consider kind of simplicity where the marginal likelihoods can be, can be, can be evaluated across the board. So I assume they are available.

58:38

Essentially, this means that we are not considering a hierarchical financial model to assume that all the parameters are given. So we can always evaluate this pretty easily.

58:51

But if you want. But if, if, if, if we have, say, a no hyper parameters and the likelihood is are not tractable, then maybe you can you say maybe those super marginal algorithms and then try to combine status through the model MSNBC with our local law enforcement proposals, and then you will still be able to obtain similar results. I guess the main idea is that essentially I'm in this talk, I was just focussing on how to make the proposals.

59:22

So I appreciate I think that what you're doing is very useful. And if I may join a second question would be so that algorithm you wrote down with that truncation, you truncated the top and bottom weights. I mean, that kind of came out of nowhere. What what can you do better? Is that like do you have a feeling that's the best you can do or or what's the intuition? So it's a great question.

59:59

I forgot to expand the intuition. So so so I guess the main difficulty in designing these local informal proposals is that although it's pretty easy to find a proposal that's with high probability, it's going to propose those states with larger parties, the difficulty is that how can we make sure those proposals will get accepted?

01:00:21

So that's the difficulty. And the to make the proposals, get to make the proposals, get accepted it, then you want to you want to kind of make sure that you also want to take into account the proposal probability of the of the process moving back. Right. That that's that proposal probably cannot be too small. So that's why eventually we we found this truncation scheme, because if we use this truncation scheme, then the proposal probability of any state can be found in the proposal.

01:00:54

Probability of any state can be bounded. So theoretically, this is a great advantage because it makes it makes the proof possible. And yes, yes, yes, yes, yes. That's kind of the main issue. I hope it is helpful. Yeah. OK, so any other quick question from the audience? OK, I think we can stop today, so thank you, everyone, for coming. Also trying again for the sweet talk. So this is a second Laza seminar this term. So we'll see you next Friday for the last seminar this term.

01:01:37

So have a great weekend, everyone. Thank you much. Stronger much for the invitation. Thank you very much for the great questions. And thank you for your attention. Let's check the speaker again.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript