Recent Applications of Stein's Method in Machine Learning

00:01

And now to. So, yeah, already you started recording, so hello to everyone and welcome to our like, a seminar of the computational statistics and Machine Learning Group, a group at Oxford. Today we are very happy to have a Gyang, do you? He's an assistant professor of computer science at the University of Texas at Austin. Here she she's pictured here at the University of California, Irvine and BSD. She has been a doctoral fellow at the Computer Science and Artificial Intelligence Lab at MIT.

00:40

His work lies at the intersection of machine learning and statistics, with interest spreading over the pipeline of data collection, learning, inference, decision making and various applications using probabilistic modelling. She's one of the leaders in the development of machine learning and statistical methodology that implements the Steyn's method, a topic that is of great interest for many people in a room. So that's why we are very happy to have him here today.

01:10

So young say that he is also happy to take questions as you may have them. I have muted all of you, but you want you have questions. You can type them in the chat or just like I say them that amuse yourself and say, say them. But anyway, so you can start now again. OK, thank you so much. It's great to be here and talk about this dance method and machine. So this is really a lot like a few years of research on the topic.

01:45

So I will not talk anything that is particular reason, but focus on the basis of the framework.

01:54

So, so machine learning and statistics one, the motivation that that I had for this talk, for this line of work was was to develop, you know, computable discriminating measures between data and model because essentially statistical and machine learning at a high level is really about doing a very simple thing, which is matching daytime model, using the data to understand models or using model to understand data.

02:26

So now because of this, essentially lots of problems in statistics and machine learning, you can find them as as either evaluating or computing or optimising some notional discrepancies. So let's see if we are doing prompt estimation. So that's the problem. You are given a set of data, which sort of points you can view it as the pick up measure and you try to find a model that fitzsimmons the data so that can be viewed as a minimising.

02:58

Minimising the discrepancy typically reduce the current divergence that gives you maximum likelihood estimation. But then on the other angle, let's say sometimes, well, give me a model of probabilistic model and we want to understand the model. Consider that this happens, especially in Beijing, when things went so wrong. And in that case, we captured your sample from that description and from the closet Monte Carlo view uncut interview that can also be viewed as optimisation problem.

03:29

So in this case, you're finding a set of samples, but your findings have some points that fits with your model so that you can use the points on the standard model. Again, a discrepancy optimisation problem from different angle different. And then you have the model evaluation how if just takes it and formulate it as the good news of it past that saying that we are giving birth to publish the model and and instead of sample and want to decide if the sample is actually drawn from the model.

04:03

So that can be viewed as evaluating whether the discrepancy equals zero or not. Now so. So that's I think that's a summary of what's doing in statistics. But then what's you know, what's additional or extra to that emission meaning is that, you know, we care about very large models, right? We have very high, you know, structured data, high dimensional data, and we have to match them with really complicated models and sometimes the new network models like what's popular in states, right?

04:43

And the problem here is that, you know, the emphasis is a bit different because the instant mistakes we often are interesting finding that the most dramatically. So make catches by other means. So in in statistics, we are mostly interested in finding the statistic. Most powerful estimations, right? The machinery we often, you know, cannot achieve that means that we can only hope to find whatever you know available to us find and we have to prioritise.

05:22

Pilot has competition was the statistical efficiency. So an example of the intractable models that widely used in machine learning, especially the just as about is these are globalised distribution models. So so what's happening here is that the probability distribution is specified by some are normalised probability, functional density function. So and what's critical the to, you know, evaluate the integration which represents here.

06:01

This happens, obviously reading statistics in Beijing, defence graphical models, lots of planning models also have this problem. As you let's see, assume your sidebar here is the exponential of your network, where in that case, people use the antigen models as one way to generate images of all kinds of things.

06:30

The traditional way to solve this problem is using maths, Markov chairman DeCaro, which you know, is known to be slow in many cases, but theoretically, you know, rigorous if it converge right on other hands in machine learning, lots of people use what's called variation on events.

06:56

This is the idea that you can transform the inference problem into just as I mentioned as an optimisation problem, overall counter divergence so that you can approximate complicated distributions using simple parametric families such as Gaussian. But but in this way, you have to specify what families you have and if if you are not doing that properly, you may end up with biases.

07:27

So, so today, I would focus on standard methods as as a mechanism, as a as a new foundation for solving this kind of a kind of discriminatory problems where you know all the other three problems in principle that I mentioned earlier about. Whenever you want to evaluate the discrepancy between daytime model, it turns out, especially for this, our normalised dispositions.

07:56

It turns out the assessment is indeed a fundamental approach to do that that allows us to, you know, avoid the computational difficulty that traditional methods such as these based on maximum likelihood and divergence path. OK, so so Stan Smith. It's a it's a theoretical tool that was developed by Charles STEM to bound as a technique to bond the difference between probability distributions.

08:36

It's quite very elegant and smart technique that was found to be, you know, remarkably powerful in the theoretical published his theory theoretical community and has been used to do lots of things. Well, recently it was was, you know, proposed as a way to prove central in history. But then people will realise you can extend it in many different ways and you can prove all kinds of probative bonds, even concentration qualities. And when we applied its holding found to be really successful.

09:22

So, so you know that this year we have a paper that is titled Distance Magic Method, which I think is a very good description of the method, but it was not well known in English, in any community, just because it was a purely theoretical tool. Just to prove since you demonstrated if you're not interested in proving that there was not, probably not that useful for you. But it turns out it's not true.

09:53

So it turns out that, you know, the key idea behind the standard method is actually extremely powerful, even as the computational tool. And the fundamental reason is that all of the statistical machine learning the computation that we have are essentially about, you know, providing bonds for foot between distributions. And that's exactly what segment is doing. OK, so so now I'm just going to diving and just this is a very quick review of stands mass.

10:26

In fact, the product spent with. And that we will use because in fact, we have we would only use a pot of science methods that is essential to us and other technical parts of class that we will not talk about it because they are at least right now and not be able to use it for communication purposes.

10:48

So the pod that we will use is a essential idea. The idea is that let's say PS, the distribution, the intractable immobilise distribution that is given to you and the whole idea of standard method is that you can you can construct is something called a standard operator that is a differential operator that acts in a function space such that if you apply the operator over arbitrary function of function, certified system, my boundary condition defensible, then you will get a zero expectation.

11:29

You will get zero mean function. If so, does, their operator is essentially doing some sort of central right centring operate. All right. So and it's constructed as such that, you know, two distributions P and Q equals if only if. If you apply the state of play to associate one P, then you will always gather the expectation zero expected zero expectation. Q This happens for arbitrary function inside a function.

12:03

So a simple the the there are different ways to define stent operator, but the particular one that we will use is something like this. So, so here basically, it is the inner to find the function that you're interested and and the lock the duty of Locky, plus the divergence operator, which is the sum of all the diagonal of the Jacobean, in fact. So here fire is actually a vector function, so you map form of natural features.

12:41

So a way to think about this is that if we simply just, you know, the trivial way to achieve this is that we can just, you know. So let me see if. Yes. OK, so the simplest way to achieve this zero meaning is the following. So you will you can be Phi Phi equals two minus the expectation. Sure. Let's say this is. He has far right. So that should be a way to kind of achieve this thing is just simply minus the meaning of.

13:21

Right? That that is the operator that can be applied over fi and that allows us to centralise everything. So you'll achieve this. But the problem is that you cannot directly calculate expectation of over oh, sorry, over P. So here should be p. You cannot directly calculate expectations right now.

13:45

What's magic about this method is that if you just we placed this centralisation with this kind of, you know, the special operator that just taking, you know, between fi and do the lock key and plus an exchange divergence thing, then you can also achieve exactly the same thing as if you are centralising using the mean on the P right now. If you can do that, you can actually convert and go back to using that to calculate the the centralisation.

14:18

And that involves solving some differential equation. But this is the essential idea is that it is ah, you can centralise everything under under the distribution, you know, by just this operator, not directly calculating the integration, but. OK, so now I need to clean up my spring. Something's wrong. OK. Yes. OK, so now what makes you know this this idea, especially intractable, is that if you look at the stand up, Peter, in fact, everything has come to Europe.

15:06

So even if the distribution is our last. The reason is that this whole stand operator depends on the distribution key on to the school function. And the school function is is the duty like which is which, well, you know, the duty equal to the duo to be divided by PE. And if you do that, the dependency on the normalisation constant is cancelled. So. So you can actually directly inculcate the school function without calculating the mobilisation constant.

15:37

And you know, that's the key, right? Naturally, if you gave me a discussion, I can just code up the stand up. It's using using Python or something. This is something that completely computes. Right? OK, so so now whilst an op ed, why that strange equivalence? So I'm going to give you some simple intuition. The best way to look at it is using integration by parts. So let's look at one direction, which is if people do queue, then that whole thing has equal to zero,

16:16

and that's actually equating to something called a standard identity. This is more well known to to, you know, statistics in general. It's more widely used than stats methods. So so essentially, it says that this whole thing grew to zero on the P, and then you can prove it just by expanding the expectations. You all have peaks multiplied by this host and up to. And then you cancel the log p. You will get this whole thing.

16:46

And this is actually a integration by parts so Pennzoil equals to the value of p times phi on the boundary, assuming it's one dimensional. And then if you assume that the product it be PHI has zero value on the boundary or decay sufficiently fast, then then you come to understand that, right? So so what this says is that we we do need some boundary condition, but this is a very mild condition because the only requires p times phi to declare.

17:21

So you can either, you know, pick. If your P is decaying across the boundary, then you don't need to worry about PHI. Now, if it doesn't decay, then you have to choose Phi to decay. So either way, it's actually easy to achieve in practise. So, so Stan's identity in particular, has been widely used. It's a really powerful tool. The reason it's powerful.

17:46

I think again, if you think about it, it's it's like a magic idea that, you know, suddenly for any given institution, p, you can get infinite number of identities that you can actually calculate, even though the function that even though the description is in check, this is this is a remarkable way. And then you can use this to do lots of things like, for example, if you treat them as movement equation, you can use them to as a way to as a way to estimate Panopto is right.

18:17

There are many different methods develop and related to this, including the score matching method for for many energy based models. Right. There are many other things that you can do. For example, you can use that is the equation and control variant, and that allows you to reduce the barriers, in fact, to again, Magic's happening here.

18:39

So it turns out, you know, under certain conditions, you can actually reduce the variance to zero, meaning that you know, the difficult convergence rate is going to end. But now you can actually get faster rate than the typical scale. And so again, it's it's very remarkable tool, but I think most people actually is more like, you know, about stance method than stance, identity and stance methods.

19:04

Well, so what the stats method does is something that I think is deeper than stance identity, but less well known. So that's about this. Then that's the other direction of improvement, which says that if he doesn't, he could kill. Then I must be able to find some in such that I can violate that equation, right?

19:25

So. So here what it says is that for any two distribution John Q that are different, I can always find a find some sort of discriminator that that gathers non-zero expectation of just an obvious place and a simple way to to say this is by this simple derivation. So basically, if you look at the expectation of of stand up to over Q. Then you can actually write it down. You can add another term, which is the same editor of Q Under Q, assuming this is true, this is the second term.

20:03

Is this identity? And then you can you can combine these two stand all operators and divergence terms caso. Then you will get the defence of the school function in product five. Right. Essentially, what this says is that this whole expectation thing is actually calculating some sort of email product between FY and the defence of the school function where it's now you fiscal function of PM Q doesn't equal,

20:30

then you should. In principle, you should find the file that violates the the the the the non-zero condition just by taking fire to be the defence of the school function. So, so in this way, you can you can show that your goal has been fired.

20:49

So again, it's a very simple intrusion here. So now that's another another way to prove it, which is less was less well known, but actually this is a way that I really like and motivated lots of my method, which is saying that it turns out this this whole thing actually relates to clan mothers in a very interesting way. So, so assume you have a random variable x that is drawn from Q. And then let's say you can remember Pfizer actually a vector field, actually.

21:28

So what you can do is you can take fire as a vector field and multiply by some small step size Ipsen and you will get an updated variable. That's fine. Now, if Axe's John from Q then shown the the the distribution of X Y is is this Q-tips on fire, which depends on both seasonal and fire?

21:48

And then what you can do is you you can take the kill divergence between shoots on fire and p and and take the due to whizzed right through Ipsen and turns out that due to UV wave cycle zero is exactly the negative of expectation.

22:05

So what's happening here is that as you apply this transform over the random variable and as you increase the step size from zero to some small value, you can measure essentially that the increased rate of CO that approaches and that increased rate is exactly the minus this expectation to stay off it. Now in this view, you can sort of you can sort of view that as this whole thing as a as the gradient, I saw some sort of gradient of divergence.

22:39

So if he doesn't look to, then obviously you have zero tolerance, you can no longer decrease it. That's why you get zero. But if you have two distributions that are different, then you should be able to find a direction that decrease the colour divergence and that your action is going to be exactly the fire that have a non-zero decreasing rate of divergence. Any questions so far? I don't see any questioning.

23:08

So. OK, sounds great. OK. And then you can essentially summarise the standard method using Using Standards Committee. So the idea is that now if you give two distributions on cue, we can just take the maximum of this expectation of stand up to what some function family, some functions set right.

23:35

And now, if the function set is sufficiently large, then you should, you know, this whole thing should actually differentiate between you to equal to zero if only people took, you know, the choice of dysfunction cost is actually very, very important in the original Costco co-stars method that was developed for theoretical purpose. But you really want that function space to be large because you don't actually care about actually computing these thing numerically and you just want to use that.

24:09

You want to make sure it's sufficiently large today compounds other metrics such as Motion Stand or to the resolution distance of, you know, and and you can. So basically the way it works is that you can use the standard equipment, see £2, lets you watch a stand and then you can show the Spanish script and see is also small. And that's why that's how you can prove or response of. But for practical purposes, you don't have, you don't.

24:38

You cannot choose Option F because we actually want to numerically calculate its hosts this sustained discrepancy. So we have to choose a function space that is, you know, both sufficiently large as as well as compute computational intractable so that we do such a sacrifice some statistical power, but then we gain computational efficiency. So that's the essential trade-off here. And end the function cost that we are using is the colour of space reproducing colonel hemo space.

25:14

So here's a very brief introduction. So let's see. You know, we have some positive stuff in Kano and then the reproducing colour space Isuzu wins that is defined as essentially the DNA sparing of the Kuno, where you can take out obituary reference point in the space, and you can have infinite number of this reference points that combine together and then you can define to the norm and apply in this way and that if you, you know, take the closure, you will get the cone of space.

25:49

And if you choose the colour to be strictly positive, definitely in certain sense, then the space can approximate the space of continuous functions al-bashir well in a bounded domain. So. And then if you just plug in, let's see, I'm optimising this whole thing, but now I'm actually working on the producing space. And here any kind of constraint that norm has to be smaller than one to avoid the scanning you.

26:26

And then you can actually solve this optimisation and postpone things on the optimal solution. Five star Yeah, exactly. The star Peter. When you apply this dollop, either over the function of functions to variables you won't integrate while the variable that devalue you another function. And then you can also show that the standard does come and see the value. The maximum value is going to be the expectation of a new kuno function, and the new colour function is very interesting.

27:00

So basically you have the original kernel and then it's a turbo function. And then you just applied the stand up to twice, you know, the first time, treat it as a rainbow of explained second time to two as a function of X, and that gives you another a new positive definitely colour that is in some externalised. And then and then you can show this is this is very similar to the kernel maximum mean discrepancy. But now we have a special kernel that is defined by the extent of it.

27:35

And the reason you know, you can you can do the duration yourself. It's actually a simple derivation. Basically, the reason we can we can solve this whole thing in custom is that this host the all. Peter is a Drina operator. And now if you optimise an opening, you know, the unit, the unit ball of hyperspace it does give always gives you a control. So, so this is a simple derivation. And then because of this nice form, nice close to home, you can actually evaluate.

28:11

Now this is really getting to our point, which is if you if the cue description is unknown and it's observed through a set of ideas, sample exi, then you can pop made this then discrepancy between IQ and using this empirical version of that. But there are different ways to do it. You know, here I'm writing this. I'm biased the you and you take it basically that the true history piece, the true discrepancy is the expectation that.

28:41

But now you can actually replace the expectation wins in pick of some. If you remove the diagonal, you will get an unbiased estimation. This is what's called use statistics. And then you can show essentially nice s and taunting properties of that. And then you can use this to construct a very powerful unions of protests, saying that if the discrepancy of the empirical data and p is larger than some threshold you can, you can basically rejected the hypothesis that people care.

29:16

So this is one way to achieve good news to be passed. And what what's interesting of this method is that now you can actually do these tests for, you know, our normalised distributions. Very complicated. Let's see graphical models and high dimensional structural models. And this was not possible using traditional methods. And the threshold here can be decided by either bootstrap or you can divide the concentration in quality or over the standard discrepancy and use that as the threshold as well.

29:54

But then, you know, another another idea is that, you know, we talked about this different view. You know, the good news tests, good news feed testing is like evaluating the discrepancy. But let's see, we are doing sampling problem when you make a vaccination. And that's like, you know, I'll give you a model. You want to essentially find a set of points you can view as funny, funniest of points of food, the goodness of the past.

30:23

If we can fool the goodness test, then then that means that the sample will use a good approximation for the distribution. So now you can actually do it as a as minimisation problem, so you can see that even distribution p I can find, I want to find a sort of point to minimise the standard currency. And by doing that, you hopefully find points that can approximate the distribution that well, this is indeed a very powerful idea and has been exploited in several different ways.

30:51

So, so the way that I will explore is a bit different. So I'm not going to directly minimise the the points because somehow it's difficult. It's getting some complex optimisation. And the way I'm doing is, you know, instead of doing this, I can solve the easy a problem that is.

31:14

You know, that is, you know. You know, assume we all have a set of points, that's why is a set of points that is, you know, generated arbitrarily so and then what we want is to find a set of ways that associate with that point, such that the weighted empirical magic of the points approximate the distribution. And that can be framed as as minimising this weighted quadratic function subject to normalisation conditions.

31:50

So that is actually quite powerful because. So here we are keeping a set of points, the same points as given to you. And in this, this is the arbitrary points. You don't need to know where it comes from. For example, you could ra I'm Sam C procedure and then you can, you know, you can get an approximation, but you are not sure if the approximation is good enough. Then what you can do is you can actually find a set of point weights to kind of correct the bias the your original NCMEC procedure.

32:23

And you don't have to know about the distribution of X Y, and they can even be generated domestically. But using this method, you can still get a set of weights that kind of correct the bias in the distribution so we can show that this is actually very nice. And not that doesn't require us to know that the proposed the distribution back side actually gives you a better estimation, if you will. If in the function that you want to approximate age here is is a smooth function of out you can get.

33:02

You can also get some benefits of variance reduction so that you can actually improve the approximation rate. So, so this is one kind of approach that we can explore. Staff met the standard discrimination to improve numerical approximation, but then that's another method that I think was really particularly interesting is that, you know, how can we actually directly finance a points to approximate a distribution? This is really the sampling problem. For some reason, I always had difficulty at.

33:46

Yeah. OK, so so the idea here is that, you know, we are giving a distribution key. We want to find this at a point to approximate essentially is the sampling problem. And instead of minimising this, then it's good to see what do. We can do is we can directly minimise the divergence by optimally, you know, changing the variable, transporting the variable in some sense, right? So it's very similar to optimal transport.

34:19

So the idea is that every time we have this particle and everything, we transport the particle using this way and then we choose this velocity field fine such that it always decreases the the divergence as fast as possible. And that can be framed as meaning maximising this negative decreasing rate of increasing vocal divergence. And essentially, this is really defining some notion of functional breeding sat on the distribution space.

34:52

And and as I mentioned earlier, it turns out this decreasing range of cargo is exactly the standard operator. So that's why you know this optimisation actually exactly reduced to the optimisation we had for then it's going to say and and therefore the optimal find that we obtained earlier was exactly that the fire that allows you to decrease the divergence as fast as possible so you can actually use this fire to transport your particles.

35:25

And then it turns out this standard is going to say is exactly the maximum decreasing radio networks. So that quantifies how much you can decrease the divergence from from studying from peer to peer. And then using that you can divide what's called stand, basically, then descend. So basically, you just take this whole thing. So basically, you maintain sort of particles every time Q is equal to the empirical measure of the particles, and then you just apply this to transform iteratively.

36:00

It's very similar to it in the sense, but it is a particle system because you have a set of, you know, particles. Each of them is high and that is updated sequentially. And then you can, you know, this is an interacting particle system because because that you know that each of the update of each particle depends, on the other hand, it goes through the empirical measure. This is something called Minnifield of the particles.

36:33

So that's why it's also related to Minnifield interacting, having the systems, which is a largely to chain applied mathematics. So this is an intuition. What's happening? So turns out the first term here is is a gradient term that drives the particles to increase the probability. The second term here is a very positive force term that, you know, practically speaking, actually enforce the different particles to stay away from each other.

37:04

And then in the end, you can get a nice, nice approximation for further distribution. So if you don't have the second term, you will cops annual, you can only find a mode like typical optimisation does. But the positive force actually plays a critical role here. So this is this is what's happening when you have lots of particles, then you got to realise the density function.

37:33

So. Yeah, so you can almost view this as kind of limited when you have even particles, then essentially these whole processes like involving some partial differential equation. And that's exactly what we can analyse here. So just another demo.

37:56

Yeah. So. So one particular practical advantage is that this has created the algorithm as to exactly who it is to create in the Senate when you only have one article and this is very nice because if you do like typical methodical methods, if you just approximate the whole discussion with one single point, that point is going to be super random and it's not going to do well in any sense, except that it's unbiased estimation, probably.

38:32

But if not only if we use as well using only one single particle, you already get the mode and the mode is already very powerful as we see in machine learning. So that's why you can build opera around the map and then gradually increase to the power. So it turns out that it has which theory associated with this type of ours.

38:57

Them, you know, in the limit when you have, let's say, a number of particles, and if you step size decreased to zero, then this whole thing and this particle Benson, you can be, you know, associated with a differential equations, and you can show that the sequester equation actually decrease the divergence monotonically, unsurprisingly with the rates that equal to the standard and.

39:25

And then you can also show that, you know, formally you can actually, you know, interpret that whole process as freedom flow of divergence on the ground in the space of distributions, right? This is really getting a very close connexion to optimal transport. So turns out, you know, you can define a some sort of optimal transport distance from Q to P as the minimum. You know, transport is a cost that, you know, transport hub mass from utopia.

40:01

But here we are using a very special, sort of very special way to define and transport costs by using the HHS the of wall. If you use the typical L2 norm, you will get a typical optimal transport thing. But then here it's kind of optimal transport. It's like economise the match, analyse the optimal transport in some sense. And they if you define that metric and you just define the gradient flow, under that metric, you will get a switch.

40:35

So this is a comparison between SPG and non-jury dynamics, which is very similar and closely related. So if you run non-Jew and dynamics. So if you run London Dynamics, it's like you have particles and every time you are adding random noise. But here in Australia, do we have a set of particle that interacting with deterministic function and then both of them can be catalysed by different differential equations?

41:08

And actually, most of them can have great inflow interpretation, except energy and dynamics is the gradient on the typical L2. Optimal transport was as we are having a special Kunal's the optimal transport. And that's also another very different way to view, as we do be very different from the gradient flow view, which is essentially what's happening as you evolve these particles. It's actually trying to do something that is very similar to Cogito methods in numerical integration.

41:43

So let's say, you know, if you remember from the numerical methods textbook, let's say Gaussian does Hamid conjecture. These methods are basically based on the idea that you want to find a set of points such that the the way you integrate over polynomials, for example, you will get exactly the solution. And then the hope is that the actual function that you integrate is close to polynomial so that you will get a good approximation. It turns out the S20 is doing something very similar to that.

42:18

It turns out you can. You can actually find a set of points, a final set of functions that in which the SPG any fixed point on West Virginia is is matching. Exactly, and that sort of function is actually decided by the all as well as the Kuno. So if you choose to stand up the properly, you will recover the polynomial family. But now we are Modiano so we can get the major console function. So that's essentially what it's doing here.

42:52

And basically, you can show that if you are approximating Gaussian distribution and if you use a demon code, then you will actually recover the pun. And if you use a polynomial control over calcium distribution, you will actually recover the polynomial families believe you use out of. You can. You can apply this method to more general distributions, and using this, you can actually show some balance. I think this opens some very interesting directions and angles that hasn't been really explored.

43:25

OK, so I think I'm out of time. So, but very quickly. You know, does civil balance, I think I can cover perfectly well well. Well, I want an extension that I think is particularly interesting is that, you know, this whole thing doesn't have to depend on the gradient. And so it turns out you can actually divide between them. Free was the best way to be. It's an idea that is very similar to important sampling, but different in important ways.

43:55

So basically, what's happening here is that assume you have the greedy end of the lobby and assuming it's very difficult to calculate, then what you can do is you can pay an arbitrary positive function and you can replace the gradient of sloppy wins the and block row. And then you can. Obviously, this will give you wrong direction, but then you can corrected the bias using the importance ratio, the the ratio between rule and pie.

44:24

And in that way, you actually still get some contact to correct decisions. And then this can be very useful if your, you know, your disposition is has is it's very difficult to calculate gradient, right? Another another another algorithm, I think is interesting more or less less understood is is amortised as well, Judy. So the idea here is that it's a finding of some particles to approximate distribution.

44:55

What I can do is I can do something very similar to again, which is find a newer network such that when you inject random inputs into the neural network, then you're network output, random outputs that follows approximate. You follow the distribution that you want, and this can be done easily using by some sort of imitation idea. So the idea here is that every time you opted in your network such that the particles followed the switch direction.

45:26

So here is an iterative algorithm. Let's see. Let me explain very quickly. So the idea is that. So let me say so every time you have a new one at work and the new another output outputs of particles, this will be the the the green dots and then you update the particle when using as we did. So the particle will move closer to the target distribution. This will be the purple dots.

45:56

And then what you can do is you go back, then you go back to the new and that's where modify the weights such that the next time the new model outputs the purple dots, right? And then based on the evidence, you will further find points that are closer to the distribution. And then you update the neurones where such that you know you will find the the dots is even closer. So by integrating this, you can actually turn your network to draw sample from descriptions.

46:27

So that's that's that's essentially what I want to talk about. You know, that's this area of standard methods. New machine learning has really, I think, attracted lots of recent interest. I think it's the area where you can, you know, lots of very interesting theoretical problems that are still very open, for example, for as we did. We don't know exactly what's the meaning of convergence, which we don't know what's the best choice of colour, which is always a problem for common methods.

46:59

And you know that many spaces for improving and extending as well, as well as the GST that we, I think, I don't think has been fully explored and lots of implications as well. In fact, this idea has been used in many epic applications, such as being first learning depending on certain qualifications. So I think that's also lots of room for applications of both. So when's that ever stop here? Thank you. Yeah. I don't know if anyone has questions otherwise.

47:45

I do have some questions. So first. Yeah, so you at the end, you mentioned this kind of like important sampling. Method that you emphasise that it is not quite important something. So I wanted to know why it is not quite the same. It is not the same because here we are not doing Monte Carlo sampling, right? So. So and if you look at this, it's weird, it's a weird method because it's actually more like the numerous important template was the proposal.

48:27

The actually is using the denominator, but in the typical invalid sampling, the actual suspicion is that you were nominated. Yeah, yeah. Well, I mean, it's it's it's it's similar in that most of most of them involves the density ratio, but but they are different because it's completely different to, you know, setting. We're not doing any colour on this increase. You can actually follow up on that. Yeah.

48:57

So I wasn't quite sure about this, but in something because you said, I mean, the big advantage is that we don't need the normalising constant for P. Right. If you go through the rule, there was some good distribution. So you can't actually calculate the ratio unless we have the normalised constant right? Yes, but but but the the you know, the. But they it's really just a part of the step size like.

49:25

Because, you know, the you know, let's see, let's say he has a normalisation constant, but you can push the normalising consent to the step size. And then if you choose the subsidies to be small, then you don't need to worry about that. Does it make sense? So let's see, let's say I have a look. So then you don't know the Epsilon, but you do. Yeah, yeah, yeah. Let's see you. You have to model, you have to divide the here, but then you have to push the Z here.

49:59

So then the yips only IPS Z, right? But then it's the step size you can choose. Yes, and then the substance goes to zero. But so, yeah, so it will impact the way you choose the step size, but other than that? Yeah. OK, thank you. Yeah. Another question. At some point you showed a like a convergence resolved and you say that the the like in order to approximate an interval this time missile based approach a convergence rate which was strictly better than the than Monte-Carlo.

50:42

So that was surprising to me. And so could you comment about that? Is it this one? Yeah. Yes, yes, it's actually is something that is very interesting, although it's not so, so let me explain what's happening here. So, so. If you are so big, the reason is that here you are designing the weights to explicitly minimise the standard discriminate. Right now it is out.

51:13

This 10 discrepancy can be right as the shape of the difference between the empirical mean and the actual mean over the kind of special space. And that space is all penned by taking the original across and applies to all people over that space. You will get a new space. It turns out for that space of functions, you know they are approximated particularly, well, biased and discriminatory because they are exactly the kind of things that will be bounded by standard screws.

51:51

But so for this family of function will get a really good approximation error. But it doesn't mean that we are, you know, having free lunch because the, you know, it could have we could have functions the outside of family that performs worse than the car. So so what? I think, what standard is what the oldest method does is somehow kind of prioritise the functions and this happens to us as well.

52:20

So as I mentioned, you can actually find a set of function that on which switch the algorithm is exactly calculating like and that's no error up to the new macro. But then for other functions, they may not be well approximated, so it's more like a prioritised space of functions. This is different from when the methods were, you know, you can get the same approximation rate across all the functions.

52:52

Mm-Hmm. I should say so someone is asking, do you have a good rule of thumb for choosing different candidates? Asked me, Do we? We don't really have that one open question. So we do have lots of insights that hadn't been really put together into automatic procedure. I guess that's the way we frame. So so what happened was that, you know, we in the beginning we kind of didn't know what to cut or to use. We know. Let's see, for example, if you use kind of it's a universe of hot tomatoes.

53:32

So it's OK, right? So it must work. So we're happy using IVF in most of the applications, and it works reasonably well. And obviously other researchers have been proposing different way, different clonal choices. But one thing that you know, I was one to explore, but we haven't. Was this kuno college of I think, which I think really gives a lot of insight on the choice of colour. So what's happening is that the kernel actually defines the space and functions on which as we look to exactly match.

54:13

Right? Just like college, internet is like choosing to match the polynomial functions. And, you know, as we did, the maths is choosing to match a special family of functions that is defined by the crow. So but then this dependency from the kernel to the function that we exactly match is actually a complicated maths. If we can somehow, you know, understand the maths and in fact numerically solve that maths, then it will be very powerful.

54:45

Because let's say we, if we are interested in calculating the variance and not the mean, for example, then we can make. We can hopefully design the Kuno such that the quadratic function is inside that function space that we are approximating or even close. Possibly then if that happens, then we can get really good approximation where in fact, the but you can see it's already happening.

55:11

For example, if the distribution p is a Gaussian distribution and if we choose the colour to be a dimensional, which is, you know, k x x trying to do x x and transpose +1, then you can show that this function space is going to be exactly the set of first of all, polynomials. So what that means is that you can actually approximate, you know, exactly calculate the meaning the variance of Gaussian if you use the union, right?

55:47

So so that actually explains why sometimes calculating the mean as we do is really good at the calculating the means of the Gaussian. I think that's that's the reason. But the typical way to, you know, using APF is not actually the right way to Gaussian distributions. So that also explain why, you know, for example, people often find that as we did, the actually tends to underestimate the randomness.

56:12

I think that's because the Gaussian idea of cool is not the right kind of Gaussian institutions actually didn't realise that right, Vikram. She. Yeah. I don't know if there are more questions like, you know, where Typekit to finish writing about, but a lot. This has been very interesting. Thank you. If you've got a minute.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript