OK, so welcome everybody to this first virtual edition of the CSIRO talks for this term of a very happy to have Radford Neal, who was my advisor at you of T presenting a remotely, and he's going to talk about some of the recent CMC work. He has been doing so again. Thank you, Rutherford four, for being able to join us virtually, and I'll hand over to you.
I'd like to talk about a new idea for doing IMC, in particular, acceptance of proposals for the agency and in particular its application to larger than methods for which it seems to be of most potential use. And how this new AMC AMC method might help with sampling from the posterior distribution and hierarchical Bayesian models. So. Sam, huh? OK. See how the Iraqis are OK here.
OK, now they're raking. So I'll start off with some of the motivation here, assuming you know a bit about AMC, AMC, but if you don't, I'll I'll do a quick review of some CMC after that and then get to the the new AMC methods. And then briefly discuss how they actually get applied to the things that are the motivating models that I discussed at the beginning.
So hierarchical Bayesian models are are arise because for realistic problems, it's usually the case that you don't want to try to specify the prior directly in terms of the actual model parameters, but instead it's more convenient to specified hierarchically with higher level hyper parameters in addition to the lower level parameters that actually control the distribution of the data.
So, for example, for a simple, multivariate regression problem with K covariance, we'll have a model for the response in terms of the covariates, the covariate vector XY here, and we can phrase that as a dot product with regression coefficients and an intercept, and we have a standard deviation of the residuals sigma. Those are those are model parameters, and we could try to give a direct directly give priors to alpha betas and and sigma.
But it's more convenient when we try to try to express realistic prior information that we instead say that the beta is given a higher level hyper parameter. Tao are independent with normal distributions with both variance tao squared and that Tao itself is given a a prior distribution.
The reason this is is better is that if we tried to do a direct prior in terms of saying that Beta JS are all independent with normal distribution mean zero and standard deviation seven, we in practise usually don't actually know that. We don't know that the standard deviation of the beta should be seven. It could be. Before we look at the data that we it, it could be that maybe the covariates aren't actually very useful for predicting Y,
so all the beta should be close to zero. But probably we hope that's not the case if we're doing this at all. So we think maybe the beta should be bigger, but we're not sure how big we don't know how predictable Y is. And so we can't plug in any specific standard deviation for the betas. Instead, we express this in terms of of the betas having a standard deviation TAO and then we give some prior distribution to tile.
Of course, we could integrate over tau and come up with with some direct distribution for the betas. But in in complex hierarchical models, doing that integration might be difficult. In any case, produces a direct prior. That's very hard to understand. And so it's it's preferable to do things this way.
Right now, that's a very simple, hierarchical model for more complex models, we might have more than one group of related parameters, each of those groups having a a unknown variance that we then give a prior to is a higher level hyper parameter. So in neural networks models in particular, we we have perhaps a number of different hidden layers that connect to the inputs and and to the outputs.
And so there's groups of weights that are, say, connecting the inputs to this hidden layer or this hidden layer to that hidden layer. And each of those groups might be given some, some prior distribution, whose variance depends on a higher level hyper parameter. And these higher, higher level hyper parameters can represent a higher level information about what the problem is like.
Such as which covariates are actually relevant or whether the the model should be an additive model or one that allows for interactions and things like that. You also get what is mathematically the same setup when you have models with latent variables or random effects, whatever you want to call them, because in a Bayesian context, this is is is basically the same sort of thing is the hierarchical priors for what one might instead call parameters rather than latent variables.
So here, for instance, is a one way random effects model in which we have some random effects in UI, whose distribution is controlled by a higher level parameter TAO, which in turn is given some prior distribution. Now, if we're doing a Bayesian inference for this, we we usually end up doing more CMC and that's OK for simple models. In fact, Gibbs sampling works OK for such simple hierarchical models.
That was these were amongst the examples early on in the statisticians use of CMC with with the gel found in Smith's paper, for instance. For more complex models, though, we may have to do something more general because we don't know how to do good sampling or it's very slow.
We might use some general random walk metropolis method or Hamiltonian Monte Carlo, but there's the problem with this if the posterior distribution for for variance hyper parameter is broad, we don't have so much data that it gets pinned down exactly what it should be. Then the Markov chain we use for sampling from the posterior distribution needs to sample both regions where there are tight constraints on the local parameters.
The beta jay, for instance, in that simple regression model, and we'll have tight constraints on them when Tao is small, because then the prior will constrain them to be close to zero. But if the distribution for Tao is broad, then we'll also have have high probability regions where towers law is large and the beta genes are much less constrained.
Now, when we try to do AMC, AMC for this with a random walk metropolis or with HMRC, the the tuning parameters for these methods are that our best are different in these different regions. Either the proposal with four random walk metropolis or the step size four for HMRC, and if we then use a proposal that's only suitable for the less constrained region, we can end up with a with a Markov chain that almost never visits the more constrained region.
And on the other hand, if we tuned for the more constrained region, it might be extremely slow in the less constrained region. So you can see the the the sometimes horrible consequences of this in a slow set sampling paper where there's a final example of a final distribution that say a model of this sort of situation. So there would be various possible solutions to this that we might try.
One is that we could just say, well, we need different step sizes or proposal widths in different regions so we can just randomly pick a step size each step. And from a distribution broad enough that we sometimes get ones appropriate for each region. The problem with this is that we then end up wasting a lot of time on unsuitable values for the for the tuning parameters. We get them once in a while for the parameters.
The tuning parameters that are suitable for the region. We're in it at this time, but we may waste lots of time. Now there might be a way to get around this. That's discussed in in one of my papers on what I call the shortcut method that tries to cut short the computation time you need for the unsuitable values. But that's not the topic of this talk.
We might decide that instead, we should reprioritize the model, saying, for instance, that the Beta J regression coefficients are equal to Tiao Times. Zayda J. And Zayda J is just given a normal zero one distribution so that we end up with better j having a standard deviation TAO. And now the prior constraint on Z to J doesn't vary. So we don't have the problem with that. When Tao is small, Zaidi's is is more tightly confined.
However, we have the opposite problem that now, depending on what tower he is, the the way that the data constrains Zeta J is now going to vary, which might cause similar sorts of problems. So another possibility is that we just alternate iterate iterations that only update for low level parameters like the beta JS with it with iterations that update only the high level hyper parameters like the TAU.
If we do that, then when we do the low level parameter updates, it's it's valid to change the proposal with the HMRC step size based on the current values of the higher level hyper parameters because they're not changing during the updates for doing so. That doesn't undermine the correctness of the of the methods.
Of course, you have to have some way of doing that, some scheme for how you change the the lower level parameters based on the current values at the higher level ones, but that sometimes may be doable. Also, the higher level updates may be pretty simple. Perhaps a disadvantage, though, is that if the lower level updates are lengthy. So, for instance, HMRC spends a lot of computation time and computing a long trajectory typically.
And that means then that we do the higher level updates only infrequently as far as the as the amount of movement at the lower level that that complex update does and that that may be disadvantageous and that we don't get his frequent interaction between the low level and high level parameters, which is necessary in order to fully explore the posterior distribution.
So that's that problem is one motivation for this work. Another motivation is handling discrete variables, which arise in many statistical models like mixture models that have component indicators where which component each data item came from, or if you're handling missing categorical data and so forth.
Now, these discrete variables aren't a particular problem. If you're doing Gibbs sampling or random walk metropolis, you just have a proposal distribution that also proposes new values for these discrete parameters, or that samples from their conditional distribution. But they can't be directly handled in a gradient based methods such as aid simply because they're discrete and don't have derivatives.
So one approach would be again to alternate AMC updates for the continuous variables with other updates that are only used for the discrete variables.
But again, we have the issue that if if the AMC updates are are complex updates doing lots of computation time, then we're spending we're doing a lot of work at the low level before we at the for the continuous variables in this case, before we change the discrete variables and assuming there's dependencies between these that are doing a lot of work on discrete variables, on continuous variables without changing the discrete ones is maybe not too useful because
you really have to look at changing both sets of variables in order to move around the space. So the the question is, can we find some method with the advantages of each AMC that avoids this problem? So before getting to to that question, I'll do a bit of background on HMRC. So this is going to be pretty quick if you've never heard of it CMC before, but just to orient people here, the CMC approach is that, first of all, we specify our model, including both the likelihood function and the prior.
And we assume these are computable up to some unknown normalising constant. So the posterior density is also computable up to a non normalising constant. We then define a Markov chain that will converge to this posterior distribution from any starting point. And we run this chain, perhaps from many starting points for long enough for it to approximately converge and then use the points that we get after that to estimate expectations with respect to the posterior.
For instance, posterior means the parameters or predictions for new data points that involve an integration over the posterior distribution for parameters. Now, the basis for doing this is that the Markov chain has to converge to our desired distribution, and for that to be true, it's necessary that if the desired distribution and variant.
So if we want to sample from a distribution with density pie, we need need to have it that the integral of of pie of X times, the transition probability for the Markov chain to go from X to X prime is equal to the two pie of X primes. So that basically says if ever we actually got to the right distribution, then after we did another transition, we'd still be in the right distribution, which is a requirement if we're going to ever converge to the right distribution and stay there.
This isn't a sufficient condition because we have to also make sure we don't get trapped in some subset of the state space. It's typical for for many CMC methods, that invariance is proved by first proving reversibility of the transitions.
So reversibility for four transitions of a Markov chain is that the probability of starting in a state X if you started according to the distribution pie and then moving to X prime from X is the same as the probability of starting in X prime if you start from pie and then moving from x prime to X. So basically, this says that once you reach convergence, the transitions from x two x prime happen as often as those from x prime to X.
And if this is true, then invariance will also be true, which is easily shown by just integrating both sides. So this is one way of proving in variance, but it's not the it's not a necessary condition for it for in variance. There are non reversible Markov chains for which this isn't true that nevertheless leave the distribution pie invariant and these these non reversible methods can often be very useful.
Now, the the oldest NCMEC method is what's called the metropolis algorithm and is a very general way of defining a transition that's reversible for PI, involving first proposing a State X star based on the current state. We're in X and then either accepting or rejecting that proposal based on the ratio of densities for X star and the current state. If we reject the same, the new state is the same as the Old State. Otherwise, the new state is the proposed the star are.
The proposals are done according to some distribution that all writers as the X star give an X and this has to be symmetrical. And the acceptance probability for this proposal is the minimum of one and the ratio of Pi X start overpay x. Now it's it's straightforward to show that this transition is reversible and therefore it leaves PI invariant. So that's. And since this, all we need is this ratio, we don't need the normalising constant for PI.
And so this is a very widely applicable method, both for Bayesian inference and for statistical physics, for that matter. So here's an illustration of the metropolis method applied to a by variant Gaussian distribution with high correlation point nine. And with the proposal distribution being a a normal distribution with mean equal to the current state and with some standard deviation here.
Point three. Uh, with the with the proposals being independent for the two coordinates, you do much better if you make proposals that actually obey the the high correlation. In this example, this is an example that's meant to illustrate a more, more practical problems in which there may well be these high correlations, but we don't know what they are to start with, so we assume we can't carefully tune our proposal distribution to to do best with that particular correlation.
So the green points here are an aid sample from pie, just so you can see what the distribution is like. And the black points show 250 transitions, the Markov chain starting from from over here. And we see that initially the starting point is very low probability under this distribution, and we see that we we progress reasonably systematically towards a high probability region. But after we get there, we move around by a random walk, which is rather inefficient way of exploring this distribution.
The red lines show the rejected proposals that where we stayed at the current point rather than moving. So here's a extension of this example, where we actually have 20 copies of this by various normal distribution that are the copies being independent of each other, and we'll again assume that we don't know anything about this distributions or our proposals are just from this spherical Gaussian with with now a standard deviation point 07.
And we now see that things are much, much slower because we had to use a standard deviation of 0.07 rather than 0.3, because we're proposing changes to all all 20 dimensions at the same time. And to get a good acceptance rate changing all those variables at once, we have to do a smaller change. And so we see we now take smaller steps to get towards the distribution, and we still proceed systematically pretty much towards the high probability regions.
But once we get here, we then are again doing a random walk. But now with much smaller step size. So this now actually explores the distribution 18 times slower than we had within when we had only one copy of the bi variant normal. So this is an illustration of of the slowness of exploring things with random walks, because the random walks have no tendency to keep going in the same direction.
And as a simple example, suppose we did a random walk in which X plus one was just x t plus from plus some normal draw from from a standard normal distribution independently for each time. Then you can easily sort of add up the variance of these entities and see that X T plus K is likely to be only about square root of K away from x t. The variance of the difference is is is k from adding up those k independent normals.
But that means the standard deviation is on the root K and we're only likely to be about that far away. So we get enormously faster distribution, enormously faster exploration if we can somehow not do this random walk or. Well, there's actually two approaches here. We could try to make it so that the changes we make our big rather than small are if they're so big that they sort of move you from one side of the distribution to the other,
then we don't care about the random walk anymore. Or if we have to still do make small changes, we want to have them not do a random walk. And that will imply that the transitions have to be non reversible because if the transitions are reversible, the chance of going from X T to plus one has to be the same as the chance of going the other direction. So it's inevitable that there's a random walk aspect to things.
Now, just a technical point here is if we have several Markov chain transitions like care them to, you want to take all of which leave the distribution and variant, then the operation of just applying them in sequence will also leave PI invariant. Hopefully, that's fairly intuitive that if each of these that invariant, the combination does. But it's not the case that if they're all reversible, then the combination will also be reversible.
That's not generally true, but it's OK that they're not reversible. That's not a requirement for for CMC. A simple example is is the frequently used Gibbs sampling procedure where we just where we have a multivariate state and we update each component of it in turn by sampling from its conditional distribution. And because we're if we're doing it systematically over the key components, then then that's a a sequential application of K transitions.
And although it leaves the right distribution invariant, it's not reversible. This non reversible ability for Gibbs sampling seems to have no particularly amazing consequences. But in other situations, arranging for your transitions to be non reversible can be extremely advantageous. So I'll now get on to Hamiltonian Monte-Carlo, which is a very popular method of sampling from posterior distributions now used, for instance, in the stand package.
The reason it's popular is it's much more efficient than than a simple metropolis methods where, for instance, you just propose a new stage by just adding random noise to the current state. The way HMRC works is we augment the variable of interest that I'll call X with a momentum variable of the same dimensionality that has a Gaussian distribution independent of X,
and the transition then consists of two parts. We first sample a value for the momentum from its distribution, which might be something very simple, like like a normal with with identity covariance matrix. And then we do a metropolis update in which the proposal is found by simulating Hamiltonian dynamics. That takes us from our current point x p to some proposed X star p star and then we accept or reject it in the usual metropolis fashion.
So this is also a combination of two steps, each of which leaves the distribution invariant and in fact, each of them is reversible. And so the the sequential application of the two also leaves the just the distribution invariant. Now, if the Hamiltonian dynamic simulation were exact, we'd always accept because that's that would be that's a consequence of properties that Hamiltonian dynamics.
But in practise, we can't do the simulation exactly. We instead approximate the dynamics by doing what are called leapfrog steps, each of which sort of moves us forward in in in simulated time here by by an amount ADA. And so we managed to go for a total time Tao by doing a L steps, each of which go for a for a time ADA. So this is a descriptors discourteous ation of the of the dynamics that isn't exact.
And because of that, rejection is possible with the rejection probability being higher if we take bigger steps. So here is the BI variant Gaussian again, we can see that now we were doing a 10 leapfrog steps in the simulation with a step size of 0.1 six, and that gets us directly from our starting point to a point down here. And then we move around in big steps after that with occasional rejections that you can see with the red lines.
But when we don't reject, we move quite a quite a distance along the distribution. And therefore, although HMRC is actually reversible, it doesn't particularly matter that it's reversible because it's taking big steps. And so it doesn't matter that they are then equally likely to go the other direction because we don't have to put together a lot of small steps in order to get any appreciable distance.
So here are some trajectories from HMRC. They start off with the red trajectory here, which was accepted and then so on. For other colours, this blue one was rejected. And so we went back here for the next one. The the dynamics confines these trajectories to the high probability region and also keeps them going in the same direction until it's necessary to turn around when you get to the end of the distribution.
So that's that that momentum keeping you going in the same direction is why there isn't a random walk within a trajectory. And here we have HMRC for the replicated by variant Gaussian with 10 copies of the buy variants Gaussian. We have to reduce the step size in order to get a reasonable rejection rate in higher dimensions. And so there is some sort of penalty, which dimensionality? But the penalty is much less than with the metropolis methods.
The penalty basically grows as the five fourth five fourth power of the dimensionality, whereas it grows linearly with the dimensionality for simple metropolis methods. And now I'll get two lines of in Monte Carlo, which is the method that that I'll actually be improving here. So Landgraf in Monte Carlo, although it wasn't originally thought of this way, is actually equivalent to doing each AMC with a single leapfrog step.
So it's just a special case. And. Doing a special case in detail here, we once again sample a a momentum variable. And then we compute a proposal with a single leapfrog step, which I have written out this way involving the gradient of the log of the probability density.
And then we accept or reject that that proposal and we can actually write out this proposal x star sort of going through what what actually happened here as the current state plus a a movement in the direction of the gradient, plus the noise, which is a another way that lines of in methods are thought of. And here's Lindsey Vonn for a bi variant Gaussian. Are we are we sort of overshoots the first step overshoots to the opposite side of the distribution and also a low probability region,
but it soon manages to get down into the distribution. Once it's there, it again does a random walk just like random microdroplets. There is actually an advantage to a range of in over random walk metropolis from using the gradient information, but it still suffers from this random walk in efficiency. And here we have it for the replicated by variant goes in and we have to reduce the step size to get a reasonable acceptance rate. And so things get worse with dimensionality.
This scaling with dimensionality is is better than for Simple Metropolis Random Walk Metropolis updates, but it's worse than for HMRC. It's scales, I think by by the dimensionality of the power for thirds, which is not as good as the mentioned outside the power five fourths, which is what HMRC is like. OK, so that's the background, culminating in the laser beam method, and I'll discuss how one can improve the larger than method by making it non reversible.
So the there are two non reversible improvements here, actually. This is made it hard for me to come up with a catchy, short description of of the new method. So the first non reversible improvement was made by Horowitz back in 1991, and it modifies the launch of an method by only partially changing the momentum variable each time. So we're for the original answer, then methods.
You can think of the state as consisting only of X because you re sample p independently of the previous P at the start of each update, so you don't really have to save it from the last one if you don't want to. For the Horowitz's modification, you do keep in the state all the time because you don't completely replace P, but instead only slightly modify it.
We change P to by an auto regressive update of making it equal to alpha p plus square root of one minus alpha squared times some noise and if alpha is only slightly less than one. This makes only a slight change to P. Then we knew proposed a new state by doing a leapfrog step as before, and we accept or reject this proposal the usual way.
A detail that didn't matter in the original Langeveldt method but does now is that the proposal not only does a leapfrog step from XP, but after doing that it also negates P that ensures that the proposal is symmetrical and that we'd go back where we came from if we were to do it again.
That is crucial to that to it. Leaving the right distribution and variant, which we proved by proving rigorous stability, which requires this negation for the original leapfrog method that doesn't you don't have to actually put that in your programme because you're going to forget anyway. But for this method, you do have to really negate P R, and then the next step is you just negate P. So there are three updates making up one of these overall persistent momentum lines of updates.
Each of these three updates leaves the correct distribution invariant, so they in sequence also leave the correct distribution variant. Each of them is actually also reversible, but the sequential combination is not reversible. And in particular, you can see what's going to happen here when you accept we re slightly change P. So P is is mostly pointing in the same direction it used to.
Then we propose a state by a leapfrog step, which will tend to move X in the direction p and suppose we accepted. Then we accepted this proposal in which we negated p r and then we negate P again. So well, we undid the negation here. So P is now still pointing in the in the direction it was pointing to at the start.
Here are, well, maybe depending on how much it changed in leapfrog in the leapfrog step, but often it would be still pointing in the same direct general direction as it had at the very beginning. However, if we do a rejection, then we negated. Then we didn't negate P here because we didn't accept that proposal where it was negated.
Instead, we had just the original P after step one and so that in step three, we negate P. And so now we're going in the opposite direction from where we were before, which is not very good because now we're going backwards and largely going over states. We we just came from that isn't very efficient, so we'd like to avoid that. But so we have to set the step size small enough that rejections are quite infrequent in order that we don't do that reversible reversal of direction.
So here's a here's what we get when we do that for the by variant Gaussian, we are. We move from a low probability point to the start towards the the high probability region, a bit slower than we did for the regular olanzapine method because we're only slightly changing that, changing the momentum now. So it takes a bit longer to dissipate the the energy we had from being at this high energy point in statistical physics terms.
But we do after a while and after that, we explore the distribution in small steps. But with the random walk mostly suppressed because we don't keep changing direction by by sampling a new P. Instead, we only slightly change each time and keep going in the same direction, assuming we don't have a rejection which results in this reversing. So you can see these successive states here are going in the same direction, mostly.
Eventually we turn around, of course, but where we're doing less of a random walk than we were with the simple change have been method. But to achieve that, we do have to make the step size be pretty small, so we don't do these rejections that reverse things. And here it is for the replicated by various Gaussian. We have to make the step size even smaller to keep the rejection rate low. But having done that, we can see that we do proceed in the same direction for quite a few iterations,
if you can sort of see these these dots in there. So. So we we have managed to buy this method to suppress the random walks of the regular Langeveldt method. But there's a cost to that in having to choose a really small step size to reduce the rejection rates to a very small value. So that's the that's sort of the slightly disappointing aspect of this, that having done all this, we see that we need a very small step size in order to get the rejection rates small.
So even though then we we avoid the random walks. It's still pretty slow compared to Hamiltonian Monte Carlo, which can use a larger ADA and which doesn't reverse and in the middle of its trajectories, except when it has to turn around because it reaches a low probability region. And so because of that Hamiltonian Monte Carlo with long trajectory lengths, a large number of leapfrog steps L tends to be better than than Horowitz's persistent momentum version of Langeveldt.
So now we come to the new innovation here, which is to say that well, as well as introducing Non Reversibility Bay, not completely replacing in each iteration will also introduce non reversibility into the acceptance decision, which which may seem strange.
But alas, so what I mean by that? So to decide I'm accepting a metropolis proposal to move from X to X star, we have to accept that with probability that's the minimum of one and the ratio PI x star over Pi X. An easy way to do that is you just generate a uniform zero one random variable you and you check whether you is less than that ratio. If the ratio is greater than one, of course it's less. If the ratio is less than one, then you are going to accept with probability equal to that ratio.
Now that's equivalent to saying that you'll check whether PI of X star is greater than S where X is a random variable. That's uniform on the interval from zero to pi of X. You can see that. So I just sort of moving pi of X to the other side of that inequality.
And now rather than choosing this as randomly, independently, each time we can make as part of the state and updated in any way that leaves the joint distribution of of X and S invariant, in which X is independent of X, and that in that just distribution.
Well, no, it's not independent. It's it's uniform over zero to pi of X, so it's not incentive X. Now it's more convenient to actually think in terms of U, that's x over pi of X and U is independent of X. So we can think of you as being a independent uniform zero one around a variable that's part of the state along with X. And then we can make our accept rejects decision in this way, using the you part of the state when we're looking at a proposal to change X.
So, so now we can do any update to you that that leaves this joint distribution and very end. And one possible update is that for some constant delta, we either add or subtract Delta Times five x two s, which is the same as adding or subtracting Delta to you. And well, we then reflect off the boundaries that zero in pie of X or zero and one effort if we're thinking in terms of you. So here's some implementation details as to how we managed to reflect our boundaries.
We can let SB pay of X times the absolute value of B, where the now has the uniform distribution from minus one plus one independent of X, and we update thee by adding delta and then subtracting two if the ends up greater than one. So this this this effectively implements reflecting off the boundaries from zero to y at zero and one for you. And now, if X changes to x prime, then we make corresponding change a V to the prime.
That's the the old V times the ratio of Pi X over PI x prime and that it keeps s unchanged. And so that's why this is a valid procedure is that we're not actually changing as in in this in this change to X. So that's a that's a bit of a quick explanation as to why this is a valid method, but I'll continue on. So now the question is why, why might this be of any use and the use is that by doing this, we can cluster together rejections.
This isn't going to change the rejection rate because still you has the same normal as the same uniform zero one distribution as it did before. And so on average, we we get the same rejection rate as before, but by by moving you slowly from zero to one and back again, we can end up having a rejections clustered together and acceptances clustered together.
So in particular, if the rejection rate isn't high as as would be the case for a large given method with a reasonable step size, then the ratio pi start over payouts is usually going to be close to one. So you mostly changes as a result of adding delta and reflecting off the boundaries.
So if if Delta is small, then then you is going to be near zero for a while as we just slowly change it by adding that delta, then it'll be near one for a while and then it'll reflect off one and and start decreasing and be near zero for a while again. So that has the effect of clustering the rejections because we're much more likely to reject when you is near one. Because remember, we we reject, we check whether we accept by seeing if you is less than this ratio.
So if you is close to zero where much more, that's much more likely to be true than if you is close to one. So the effect of this is that we'll tend to get a bunch of rejections when you is near one, but then get lots of acceptances when you is near zero.
And now the the benefit of this is that if the for doing the Horowitz's larger than procedure with the momentum only slightly changed each time by we get we avoid random walk behaviour as long as we don't have a rejection that reverses the momentum. Whereas when we do have a rejection, we we we reverse, we reverse momentum and and things don't work so well anymore.
So we want to see by clustering the acceptances together, we'll get more consecutive acceptances that keep moving in the same direction. And you can see how that is better by a simple sort of model here. Suppose they each accepted move moves the direction D in some direct moves, moves the distance in some direction, and each rejection randomised is the direction which isn't quite what happens with with that method.
But let's suppose that then if we do 20 times some number K iterations in which they take the form of four acceptances, one reject for acceptances, one reject and so forth. This all repeated K times. Then on average, the distance we move is going to be the square root of 4K because we are there are four K groups of four accepts and one reject. And so each of which moves in a random walk between those groups.
And so the the sort of distance that we move in in those four K groups of of moves is the distance moved in one, which is four times the square root of the number of those which works out is eight times the square root of K. But if instead we do these 20 K iterations in the form of 16 acceptances, followed by four rejections, which is the same acceptance probability as above, then we get each of these moves the distance 16 D and we have K of them doing a random walk.
And so we move 16 times the square root of K, which you can see is twice as much as that. So there's an advantage to clustering the rejections together, because then you also cluster the acceptances together, inevitably. So that's the the motivation for trying to do this non reversible scheme for doing acceptances involving just adding delta to you rather than sampling a independent value of you each time. So here's a illustration of this for the base variant Gaussian as before.
Here. Well, it's a bit you can sort of see that it doesn't do a random walk by it, by how it goes here before it actually has a rejection and has to reverse there. And while there's a few too many points to see things clearly here, but it's it's not it's not doing a random walk in there, and that's even though it's using a larger step size than for the previous illustration of the hoards Horowitz's version of the avalanche method.
And here we are for the replicated by variant goes in again, we can use a larger step size than we did before, and we still get a lot of suppression of random walks. As you can see here, for instance, it consistently goes in the same direction for many iterations. So we do need to tune things, of course. So we have to keep it a small enough that we don't have a have a really high rejection rate, but we don't have to keep the rejection rate as small as before. And we also have to tune Delta.
It seems that about one minus alpha is is right, although I don't have any any precise ideas about how Delta should be tuned. And when we do that, tuning properly compared to a properly tuned HMRC, we get about the same performance, it seems on this example. And if we look at the value of you, add each iteration here is the value of you, each iteration for the regular persistent lines of method that Horowitz came up with and the rejections are in red here.
So you is sampled independently each iteration. So we get the expected scatter here. And well, there are some of those are rejections. Here's what we get with the non reversible update of you at the beginning. We get a bunch of rejections because we started in a very low probability point, which causes problems at the beginning. So you can sort of ignore that and then see what happens after we've gotten to the high probability region.
And you can see we we have a lot of consecutive rejections when you is close to one, but then we have a big stretch of time here where there's no rejections when you is close to zero. So we've successfully clustered the rejections together, making them less damaging as far as as randomising the direction so we can manage to suppress random walks for a long time here and make better progress than if we have rejections mixed in there that keep reversing the momentum.
OK, so now I'll I have a few minutes to talk about how these could actually be, this method could actually be applied. Here's a test idea regarding discrete variables, so this is a simple distribution, which means we can actually tell what the right answer is. That's meant to be a model for more complex situations where we have a bunch of continuous variables that we want to update with HMRC. But we also have some discrete variables which we can't update with HMRC.
So we're going to update those with Gibbs sampling, and we alternate the updates of the discrete variables with Gibbs sampling with updates of continuous variables with HMRC. Now here there are only two continuous variables, but for this example that you can imagine that there is lots more and that the updates of the continuous variables are actually the dominant cost. And so I'm going to assess the the methods on that basis.
The the HMRC updates are the ones that are costly. And so if you do that and you do if you do the non reverse of the lounge, then not reversible, both in terms of Horowitz's modification to only partially change p and the non reversible change to you for the acceptance probability, then you can if you turn things properly, you you get a method that's 1.8 three times more efficient than if you tune HMRC optimally for this problem, at least.
Well, I didn't tune them precisely optimally. You can see which values I checked, and then I picked the best for each of these methods. So that's that's encouraging. I don't know what the scaling is here, and this is a relatively low dimensional problem and such if you have much higher dimensional continuous variables and perhaps more discrete variables for that matter, how the whole thing scales.
I don't know whether you are, whether you still get only roughly twice the efficiency, which would be slightly disappointing. I mean, getting twice being twice as efficient is nice, but you'd like if you were even more efficient or whether instead the advantage gets bigger and bigger as the dimensionality goes up. Or I suppose it's possible to get smaller as the study goes up. I don't know the answers to those questions yet.
One can also look at how this method works for Bayesian neural network models, which can be very complex hierarchical models. So here's a a description of some networks of increasing complexity, which are actually the ones I'm playing around with at the moment. At the simplest level, we could for a regression network, for instance, we might have one hidden layer of of 10 units if you're familiar with a neural network models.
Those are a common, although somewhat old fashioned activation function for 10h is is a common activation function for the hidden units. So we have one layer of those, and then that means we have three groups of weights. There's the weights from the inputs to those hidden units. There's what are called the biases or the intercept terms for the hidden units. And there are the weights from the hidden units to the output. And in the simplest model, that's that's those are the only groups you have.
So you have three hyper parameters controlling those three groups. You also have a have a four regression model, you have a parameter. That's the residual variance for the response, which also is effectively plays the role of a hyper parameter as far as its effect on how the model works to make a more complicated model.
But say that, well, we're not really sure about how relevant each input is, so we can instead have a have K plus three hyper parameters in which we have a separate hyper parameter for the for the weights out of each input into the hidden unit. So we have such hyper parameters and then we need since we got K of them, we're not we're not sure what prior to give them.
So in the same way as we did before, we say, well, since we're not sure we'll we'll create a higher level, higher parameter that controls the distribution of these K lower level hyper parameters. So that's another one. And then we we have the hidden unit biases that we have a hyper parameter to control and the hidden output weights going up in complexity.
We could say, well, maybe maybe the function here of these inputs has a additive structure where it's a a a function of one input of input one plus the function of input two plus the function of input three. Or maybe it's at least sort of approximately that. So we could have a model in which we have a K hidden layers, each of which looks at only one input for the K different inputs and and connects to the output, of course.
And those K hidden layers can compute the key components of an objective function. But maybe it's not completely additive, so we'll have another hidden layer that looks at all inputs and could model interactions, and that also goes to the output. And you add it all up. We have three K plus capas three hyper parameters there. So now, how well do the methods work on these models? Well, that's what I've been running simulations for for the last week or two, and it's actually pretty hard to say.
One reason is we don't know the correct answer. These are obviously these. These are. Are models that are far too complex to to to know with certainty what the correct answer is. We could look at the air for predictions on a test set. But unfortunately, there's no guarantee that the method that's best at sampling from the posterior will actually produce the best predictions. The data might not be well fitted to the model and in which case you're better off predicting.
You know, doing the wrong thing might actually be better if the model is not very good or even if the model is good, it's still based on limited data. You have a posterior spreader spread out someplace, and even if that includes the good values, it also will include some not so good values if you don't have a lot of data. And if you're sampling method doesn't really explore the full distribution, but happens to be in the good spot instead.
I could look better, even though it's actually worse as far as sampling the posterior. So it's hard to tell. I've been playing around with a so I previously played around with a simple classification model and looking at the correlate auto correlation time for the hyper parameters. The non-negotiable Lanzmann method was a little bit better than HMRC for the more complex regression models from the previous slide. I've played around with six inputs, all of which are relevant to varying degrees.
And with with the true function having both additive and intra interaction aspects and five training cases. And if you didn't fit the simplest of the models with only three hyper parameters, HMRC seems to do a slightly better job at prediction than the of the Langeveldt methods. But for the more complex models, the best predictions are made by by non reversible Langeveldt methods, although it's there's a lot of variation here.
You change tuning parameters and what doesn't seem to be drastically large amounts and you get much different predictions are much, much different predictive performance. So this this this it makes it complicated to see what what's actually going on. So I'm still playing around with this to see how much better you can get than HMRC by playing around with the non-responsive methods. So that's it. OK. Thank you very much for the talk, Robert. And now we'll open the floor to questions.
I should say perhaps that there is a bibliography here if you want to look at the slides at the end. Okay. Mm-Hmm. OK, thank you. So if you would like to ask a question, please unmute, resolve and ask. Hi, can you hear me? Yes, I can hear you. Hi, hello, Rothfeld. Very nice talk. Thank you. My question is whether you try to do similar things with Hamiltonian Monte-Carlo itself. Like, can you do also a non reversible, persistent? Yes.
I've expressed Horowitz's method as applying to lingerie, which is HMRC with a single leapfrog step, but the same thing can be done for HMRC with with more than one frog step. You can decide that you'll do it trajectories of 10 steps and then you'll also not completely replace the momentum before the next one. You'll do the same sort of thing. I've only partially changing the momentum.
And then you have the stimulus thing that if you actually rejected that trajectory, you'll end up reversing the momentum. And so you can do combined ones that that sort of tried to suppress random walks by a combination of doing multiple leapfrog steps per trajectory and only partially changing the momentum. Kennedy, I believe, investigated this at one point and found that disappointingly, you can only get a small benefit from the sort of hybrid methods that that you can see.
This is a tuning parameter, then is not only do you tune the step size and the number of leapfrog steps, but also how much you replace the momentum. And with that, that space of an extra tuning parameter, you can get slightly better than if you stuck to straight HMRC. But but not not not all that much in particular didn't change the the scaling with dimensionality. I think there's a paper by Kennedy and maybe some other people on that. Thank you.
Thank you, Peter. Anybody else? I mean, you hear me. Yep. All right. Thanks very much. That's very interesting. I thought that truck was making you a random variable. This was really neat. But if I understand correctly, the update was not deterministic, it was stochastic. Is that correct? Yeah. Well. Well, no, it's no. It's actually deterministic. If you. I mean, the actual update itself. Of course, things change sarcastically at a larger in a larger fashion.
But the actual the the actual thing in the state is that, as I have expressed, it is is v and you is the absolute value of V and then X is pi of X times the absolute value of the. And the update to V is itself is to simply add delta and then subtract two if if it's greater than one. But that's the actual update to be. But then V also has to change when exchanges of X changes to X star attacks prime. Then you have to change V to v times pi of X over Pi of X x prime.
So that the the proposal that resulted in that will have been stochastic. So. So we will change sarcastically as a result of of the changes to X, which is what you see here as long as we reject. So we are changing. X v precedes deterministic Lee. Well, you see these deterministic v up to one and then reflects and comes back down. That corresponds to V having switched to minus one.
And then and then the increases in V at that point are decreases in U. But as soon as there's an acceptance, there's a bit of jitter here because the ratio of Pi x prime to Pi X won't have been one. And so for the acceptances, there's some slight stochastic aspect resulting from the stochastic changes to it to X. And so there's no constraints that's there, that Delta not perfectly divide the interval and sort of thinking that I want to.
OK, so you could you expressed as you just at Delta, but you could out Delta plus some noise if you felt like it, in fact. And in fact, the noise distribution can be anything at all. And you might do that if you were worried that you are otherwise. So, for instance, if you're sampling from a uniform distribution. So the the ratios of Pi ex-Prime over Pi X are always one four four accepted moves then.
And you chose Delta to be exactly one half. Then you might worry that you're not sampling properly because they wouldn't actually matter. Because yeah, if you'd make it slightly more complex thing in which nevertheless, the ratios of Pi X primer grouping, there are always some, some small number of values, then it's conceivable. But you can end up being nonorganic by just adding Delta all the time. So if you're worried about that, you can add Delta and some amount of noise if you feel like it.
Thanks. OK. Any last questions for a for us? Need more questions for. Last call. Mm hmm. Okay. I should maybe mention that amongst the references is a paper on archive on this, which is this one here. Non reversibly updating uniform zero one value for Metropolis except reject decisions, which is on the archive. OK. OK, I guess then we'll wrap up the talk, and I think you get rougher for being able to join virtually in the air for great talk.
Thank you. Well, thanks for. Thanks for inviting me. OK, thank.
