On classification with small Bayes error and the max-margin classifier

00:00

Then to hear the approximately correct who is Professor Zoric, Sarah has previously held positions in Bristol, Utrecht, Toulouse and Leiden, where she received APHC in 1987, and she is an expert on empirical processes, having written the book Empirical in Estimation, which I use quite a lot during my master's thesis. Thank you for that and has worked recently on lots of things relating to high dimensional data. So we're very excited to hear about what she's going to talk about today.

00:41

So for questions, if you want to ask a question, there are two ways you can do it. So the first is to raise your hand and then I will interrupt Sara at a convenient point. And you can ask you and then you can ask a question or if you prefer, you can type your question in the chat and then you won't appear on the recording. And I can read the question out to Sara for you. OK, excellent. All right, take it away, sir. Thank you.

01:17

Thank you very much for this nice introduction and thank you very much for inviting me. So let me start screen sharing. This one. It's already full screen, right? Yeah, good. So so in my thought, what I will address is the following in many algorithms nowadays, machine learning algorithms, for instance, they report zero training error, but still a good. Just error. And so I want to somehow contribute to an understanding of this phenomenon and I will look at classification problems.

02:08

And in classification problems, if you have a small test error, it must mean that the base error is small because that's the unavoidable error that you always make if you believe in a model. So the test error can only be small if the base error is small. So I'll look at that context. And then look at the algorithms with zero training error, actually just three algorithms and zero training error means that they interpolate the data right.

02:43

And if you interpolate the data, that's only possible if you have enough flexibility to do that. So that's what I will look at, the situation, which is high dimensional, so that you have like a regression model with very many variables. And then the main issue will be the other boost algorithm, which you might know, which has been around for a while from Shapir, Point and Shapir, and try to explain its success a little bit using this Max Marxian approximation.

03:21

OK, so this is the title again, and then this is joint work with Jeffrey Chenault, Felix Meister and Matthias LeFleur, and these are so Geoffrey and Matthias are both stocks at our institute. And Felix is a Ph.D. student. And so I, I work together with them and they're a great team. And so much or maybe everything of what I'm going to tell this is they're hard to work.

03:50

OK. So here's a little bit of the overview, so I will look at the algorithms with zero training error, that means that the interpolated data and I'm going to concentrate on the special case where they the final INTERPOLATE is the one with the smallest L1 norm, because you can interpolate data in very many ways. You can also look at interpolation, looking at the smallest L2 norm, but that's a completely different situation and.

04:25

So my view will be more like the sparsity view where you should sparsity. And if you do that, you want to, of course, see how how it relates to regularisation instead of interpellation. So if you're high dimensional model, you either interpolate or not, and then you do regularisation that you introduce some kind of penalty term or regularisation term. And I'll just look at that. And it's kind of easy for that case to prove.

04:56

Some bounce and they will form the benchmark for the interpolating case, and then I'll show you some rate of convergence, which I can then compare with this benchmark. So here's the setup. So it's going to be a theoretical work where I have given design matrix X with and rows and columns. So NSA, the number of observation bees, the number of variables, and B, the number of variables is assumed to be much larger than and and so that's the high dimensional component.

05:41

Then you can think of a regression model with an unknown vector of coefficients, which actually will look at classification. And then the the the the the scale is not identified. So there you can assume without loss of generality that the vector of coefficients has L2 naum equal to one. That's just a normalisation assumption. And the sparsity assumption is that the effect of coefficients is only a few non-zero entries and the nonzero entries, if I count them, I have this sense for a sparsity.

06:22

Then there's a noise factor, so the noise is normalised in two and I call Sigma the the noise level and that's the quantity that's sparse. So in this setup, it's more or less also like engineering. The noise is not necessarily independent of the X. It's going to be adversarial noise or so it can be also a deterministic. Quantity, but you may think of this and it will be as a first step if you're statistician's, that's the way to think of it.

06:52

You may think of this as just a random noise. And then this segment is here. OK, and then dispersed the issues to be small in the usual sense that so say like s the fastest a number of active variables and it should be smaller than the number of observations. And then there is a long term. In terms of the number of variables to be because you don't know a priori which coefficients are non-zero.

07:27

Oh, so here's the notation that infinity norm with an infinity here, the L2 norm, the L1 norm is just some of the absolute values and the L. Zero is not a norm. But that's counting the number of entries, which are not zero. OK. So these are the two programmes I want to look at. So you observe this design matrix and you have this regression function. So let's say it looks like this. So just a regression function plus noise, but you only observe the same.

08:06

So you don't observe the regression function itself, and that's like in the classification problem, so the response variable is binary plus or minus one. OK, and. The interpolating which is developed by Blunt and vaccinia, so I called it upon and for sheening estimate, there is the one which says, OK, I'm going to interpolate the science, which I observed.

08:34

And I used to play there with smallest one norm. And that's like one bit compressed sensing, except that we're looking at a case where there is noise as so compressed sensing is where you observe X Beta Star and you want to recover the coefficients. And one bit compressed sensing is that you even further compress the real value, random variables or whatever real value variables by a binary variable.

09:03

So then it's only one bit. So it's a compression motivation, but you can also think of it as just the classification problem. Now, in this case, the the if you minimise the elbow nerve because of course, the skill is not identified. We now have to somehow use. And normalisation, and that's the empirical norm. Of the. Beta, so that's the vector X, that's a real vector column vector, so that's equal to one.

09:40

OK, so this is a factor in dimensional space x beta stars and in dimensional vector, and these are the entries of the 10 dimensional vector. OK, and it makes margin is the same situation. You observe what X and the sign of the regression. And I look at the margin, which is why to be. And I look at the minimal value and I want that to be maximal, overall factors be with L1 normalised and or equal to one.

10:15

So that's, again, a interpolating designs because you certainly wouldn't have that design y in the sign of the XP, that will be the same. Because you want to the minimal value to be maximal, so you will certainly have that's why times be will be positive. So those are two interpolated. And then we want to obtain bounce.

10:43

In to so the true in later, so sorry, the true fact of efficiencies is to naum equal to one because of the normalisation and in my estimate I also normalised to have Delta Norm equal to one. And then I want to find out how close are they. OK, and I'm going to look really at this compressed sensing type of situation where the design matrix consists of used in the standard Gaussian random variables, which is maybe a bit of a disappointing assumption, but at least it's a starting point.

11:20

And it's going to be like, I don't know how to generalise it yet. And this is the notation, so in probability of order at most, just not to get too complicated for us. OK. OK, so I'm going to look at Syria and there is some theory about one we compressed sensing this plant in Virginia estimated that for the next March, I'm not aware of any theory and it is related to the other boost algorithm. So we get our aim is to get x ray results for add a boost via this march in the Interpol later.

12:07

So let's look at some basic facts, which I'm just showing you, because I think it's kind of nice. So if you look at Vector's with Norm equal to one. And well, then the expectation of the product of their science, so x bita is just a linear combination of Gorshin set and X D also a linear combination of Gaussians and the expectation of having the same sine is the arc sign of the inner product.

12:39

It's called to the heart of the identity. So the probability that the signs are equal, you can think of this as the probability of making a misclassification is to Arko co-signers of the inner product between the two factors.

12:57

And that's not hard to see if you think little bit so you can sort of think of it as a two dimensional situation, although we look at higher dimensions where you have two vectors and they are Naum one, so they lie on the sphere and the Jodeci distance is just this distance along the sphere. And that's the probability of making a misclassification error for those two factors.

13:28

And then one over by here, OK? So what I'm going to do now is just say, OK, for the classification problem, you just have like for regression problem, some splitting up into the unavoidable error sigma squared and the error due to the estimation of the unknown coefficients. So let's do that. So you have this inner product. Between two vectors, if they have Naum equal to one, it's one minus the distance squared between the two vectors, that that's quite obvious.

14:06

And then there's one half year. OK, so the inner product is related to one minus the. Distance squared in this way, so if you look at the probability of a mass misclassification, it's just this are cosine of the inner product, which is one minus the distance squared. And then you just see that if you do a one term data expansion, that the distance is more or less the L2 distance. If they are close together, which is close, which is clear from this picture.

14:40

So did you distance distances along the curve at the straight line is not much different. OK. Now, suppose you have so again, just for one observation, say this is one observation from the test set, and so you have the sign of the regression and you have the noise with some variance.

15:04

Then the base terrorist probability that the sign is of the noisy regression is not the sign of the regression without noise, and then you use these whole identity and you see it's to sign cosine of this quantity, which is by doing just one term data expansion for approximately equal to two sigma, the noise level. OK. So, OUP. So the probability of making a misclassification error if you use B as your classifier.

15:45

Is just our cosine functions, which you can then split up in a term involving the unavoidable base error sigma squared and a term due to the estimation, so the due to making an error in estimating beta star. So this is exactly like a regression to a regression. Also, you have the excess risk, which is this Delta error. And then Sigma squared, so that excess risk is here in Sigma squared is an unavoidable error. OK, now I want to show a result for the vectorised estimated and.

16:32

So I need some identities which are very straightforward. First of all, if you look at the expectation of the inner product between the sign and this linear combination, it's again, in a product of the truby to start and be.

16:50

And then there's some coefficient in front of it, so if I put here and the difference between the Vector B, so B is an arbitrary factor here and the difference between the vector B and the vector B to start a true vector, of course, B, to start in the end product, which itself is one. So you get this expression, which is, again, the one minus to in product, which is again the end to the L to distance squared.

17:23

So if you want to estimate the effect of coefficients, be the star, then this can make a good loss function because this is clearly minimised or a risk function. This is clearly minimised at the equal to Beat Star. So what you do to estimate B to start is just to replace this expectation by an average and minimise that. OK, and then you do a regularisation to make sure that you can deal with this high dimensional situation.

17:57

So we do that, so there's a minus sign missing here, so instead of minimisation, we do a maximisation maximisation of the margin and then minus the regularisation term, the L1 norm of the vector of coefficients. And you want to still be one, so this is just like the lasso, that is just the lasso. If you observe the regression function, you can do the lasso, just the least squares and have been analyzation on the elbow norm.

18:30

That's the lasso. And this is not much different. OK. OK, now let's look at this march and so the average margin and its expectation, this is a factor of lengths B, so this is a very long vector and I subtract its expectation and then they take the maximum. So I take the maximum of P random variables. They are independent. So you can well. Sort of show. That it behaves like squared, look over and just like a maximum of Galicians.

19:15

And so that is these are more or less, gosh, Anthony, a few of them and a maximum logarithmic. Inbee And there's a square root. There's an end because of the normalisation with one over in here. OK. So if you do this regularised, estimated to be dieties to regularised, estimated. It's lost or so the Alta convergence rate is here, it's more or less. Squat and look, be over in times to L1, normal feet to start and then the noise place to roll here it's one plus sigma squared.

19:57

And so this is my benchmark for the regularised case. And just so that you see that it's not very deep. Let me look at the proof. So the estimate is defined by maximising this quantity. So it will be larger than the same expression at the true victor to start, right? So I don't just put minus sign in front so that you sign switches. And I put the the penalties together and there comes somebody visiting me, the cats. OK, um, so this is just rewriting that you're maximising something.

20:43

And then I subtract and add expectations, let's see. So this is the sort of the theoretical risk. And we just saw it's equal to this expectation. And then I replaced the expectation by an average. And then I have to correct for that. And here I can use the dual norm inequality's, so the inner product between two vectors can be bounded by the Infinity Normington one nor. Then you get this laptop had this this infinity norm, and this is the one norm Bita had minus Beta Star.

21:26

And they're left over with the term which was already there, and then we used this arc inequality. Which we just saw on the previous page. And then we used the triangle inequality. Yeah. So there's nothing going on. OK, yes, Judith, you still have a question. I just I wasn't sure what the hotkeys which you said it afterwards, so he just looked at it again. It's a good point. It goes quickly, leapt ahead. It's like the maximum of Pete, more or less Gaussian random variable.

22:05

So that's a random variable, which you easily have some kind of concentration inequality for it. So with large probability that will be of order squared. Look, be over. And just like for the last of your usual thing. Yeah. So that means here this is the inequality, if you take your tuning parameter, not too small, but also not too large. It gives you about a high probability. Yeah, OK. So it looks very much like the lassalle. You can.

22:50

We find that if you wish to execute sparsity, so suppose that are only a few non-zero coefficients in the true vector and the number of them is littleness. Then the rate of Convergys becomes Lakdar Square times and then something with the noise. And that is, again, of the same order, so that means you'll have to square it is look be over and in order. So you get something like the number of active variables divided by the number of observations a long term and one plus sigma squared.

23:27

OK, so the proof is, again, very simple, but I don't think I have that much time. But this is the proof. It's really just being a little bit more careful with the triangle inequality, that's all.

23:43

So this is the benchmark, so the MySpace classification error of these regularised estimates consists of two terms to unavoidable based risk and the term due to estimating estimating beta star, so to excess risk and the excess risk is squared lock B over N times to Evernham if the true vector is only a one spa's. And it can be improved to be overturned if the true vector is zero Spar's, which means that it has only s non-zero coefficients.

24:25

And you see what is happening here. If there is no noise. The if there's no noise. It's the area is still not zero. Which is unlike for the last, so and that's because you only observe the sun, you don't observe the complete regression function, so there's a price to pay. You cannot really exactly recover. B Star. And it's I'm not sure if there are lower bouncier, but. It's quite clear that you cannot exactly recover it, and so they're still an error, even if you don't have noice.

25:05

And if the noise is small. So you have the unavoidable noise here. And if the noise is small, this is all in order, so if the noise is small one plus the noise squared is of the same order as just one. So you can sort of cancel this term. And so the noise doesn't really hurt in a sense, if you look at the order, at least if it's small and that's the case, we're interested in small noise because otherwise we'll never have small test error.

25:36

OK. So that's my benchmark. So now let's compare that with the interpellation, so the plan and for sheening interpolating and Dymocks merchant later. And we will use them to study them results from bases for shoot, which is going to be noisy bases for shoot. So what's Noisey basis for a suit in basis, noisey basis, you do observe the regression, so you do observe this regression, the real value response variable. So you just have a regression model and then you interpolate. The data.

26:22

So see you interpolate the data and you take the interpolating, which is L1 naum minimal. OK, then there are theoretical results in this paper by cheek and also in more recent work. And so what we derived is the same as this paper from 2010, but our processes are quite different. So this this is more like a real. Yeah. Like from a different community, I would say machine learning or computer science. And our language is more statistical. So that's.

27:03

A noisy base, for sure. And then later afterwards, we will consider the case where you only observe the sign of the regression function and we look at it later from Blunt and for Sheenan and we look at the Marks,

27:17

Marks and Interpolate, OK. So that's to come in all three cases, I will need a good bound for the elbow norm of the estimate, and that's kind of clear because once you have a good bound for the elbow norm, you can use results from empirical process theory of empirical processes with coefficients bounded in L1, and that gives you about four entropies and so on. So then you're in business without explain it in more detail later also.

27:52

OK, so let's look first at bases where you do observe the regression function with Noice. And you interpolated, you choose to one with smallest one or OK. So what is a good bound for the estimate there? Maybe I should go back. And so the point here is if you have noisy and. Observations, the true beta star is not interpolating quite clearly because of the noise, so somehow we want to interpolate the noise to make sure that your interpolated data.

28:36

And this is more or less what what the result says, if you want a good pound for the L1 norm of the estimated, you can be counted by the L1 norm of the truth in a term which is the noise level, Times Square and divide by lockpick. And the reason for this is sort of follows from the chaotic conditions or maybe from the from the dual formulation, and if you do that, I won't give the details.

29:08

It says that the 11 norm of the estimate is bounded by the L1 norm of the truth and then a term involving the L2 norm of the noise, divided by something and divided by something. You want to divide it by something large so that it will be small. And if something is the maximum. Of Gaussian random variables.

29:32

But. OK, these linear combinations each time for different love, that you have a different number of Gaussian random variables and then you minimise again, overlap the supposed the minimum of a maximum. And there comes the assumption that the sample size should be small compared to the number of variables, because we want this to be large. And if you look at extreme value theory, if you have to be so big, independent Gaussian random variables, you have this extreme value limiting distribution.

30:12

And I think it's a Google limiting distribution. So we know this maximum. The that's like Squirrelled Lockerby. But it's a little bit more involved because we have, again, an information here, so this is a lot of technical work and we want it to be non asymptotic. So we get that this information behaves like indeed like square root Lockerby. If the sample size is small compared to be, for instance, the sample size is smaller than be divided by Lockerby, something like that.

30:48

So really the high dimensionality is your playing is crucial here. Otherwise we won't have the lock be here. So this term is the square root, and this year and the lock is here, square root and times super. So if you didn't have about if you didn't have this bound for the one norm. You can use Sambuca process theory to get bounced for quadratic forms where it is Elgon norm of beer.

31:19

So you just take the empirical average minus the theoretical average, which is one of the to go fishing normalised to one. And you look at these quadratic forms, how different are different from the expectation? And then the L1 norm plays a role. And we just showed that this L1 norm behaves like square root and divided by Lockerby. So that's exactly nice cancelling. So that gives you that these pretty radical forms are dist. like like this Sigma's stick.

31:53

My head is occurring there. So that gives finally this result that the estimation error of the noisy basis pursuit estimate is bounded by C times. Some universal Constantines stick my head and of course, if there's no noise, then it means that it's zero. So that's in basic you have in the nonlawyers you you have exact recovery. But the noisy Ferreti you have now. Some. Some error still, but if the noise is small, it's still reasonable. So now let's look at one bit compressed sensing.

32:40

So now we only observe the sign of the regression function, we minimise to Eleanor, as in the basis for shoot. With an interpellation and with a normalisation. And then the question is, again, what is a good amount for the estimated? And again, you see the true fekter with star is not interpolating because of the noise. So, again, we have to interpolate the noise. And if we do that. So the idea is so if you if you put your to true factor for B plus something which interpolated noise.

33:24

Then you have a candidate because then you have an interview later. And so do the elbow normal of this candidate will be less than the elbow will be larger than this one which minimises it. So don't we have about four to elbow normal of the minimise her? If I have canidate here, which I can control, then this one we will have elbow normal smaller than this canidate, ok. OK, so why not interpolate the noise, so I interpolate here to noise sized noise and I think we had the Yeah.

34:06

The L1 minimise or which relates to noise and I just dig here. So I know from the previous result from bases pursuit, I have this elbow norm for this interplay there under control. And we're over if I added to be to start a true victor, I have an interview later with the right sign. So that's a candidate, so I know that one of the estimates responded by one normalities candidate. Oh, yeah, I have to normalise it, but that doesn't really hurt very much.

34:42

So then I'm I'm done. Then I have about four to one. And so that's the idea. So I have my true Victor, I have my noice. I interpolated with some other factor. And then X be the star plus noise is the same as this by definition because I interpolated it. So I have here a vector with the correct which gives the correct sign, which this one interpolate. The signs. So the one normal beat, the starters, beat the head will be larger than of my. Planning for Sheenan estimate, which is to minimise.

35:29

So then again, you have about four to one norm, and I'm not going to show all the technical details, but then you can do empirical bounce off. The premium of a process is indexed by functions with coefficients with L1 normal, bounded by some quantity. And then we get this result, so the planning for Sheenan estimated this interpolating. As in L to the if you have a sparsity, the rate is like before Uzelac be over in.

36:06

And now if you look at the benchmark, it was times one plus sigma and now it's just plus Sigma plus genoise. And for the if you have only one sparsity, you have again, it's exactly like the benchmark estimate. But instead of multiplying by one plus sigma, you add the noise sigma. So it means that if noise is some. Is a large. You say small but large, still larger than this term, you still feel it so in the benchmark estimate. If the noise is small, you hardly feel feel the noise.

36:52

You only you already have base noise. And that's the main part. The here it can be. That noise is larger than the area due to estimation. So. So it's not a less beautiful result, which is, of course, clear because you do interpellation instead of regularisation, so it's not not as good. OK. So how much time do I have? What do you say about ten minutes, five minutes? OK. I'll skip that, so now go to Adam. So we observe as before X and the design and adapt algorithm is as follows.

37:44

You first take a vector of weights which are on the simplex. Well, added boost is for weak learners, they are defined differently, so to mimic that, we normalise our design so that it all all design entries are between minus one and one. You start with an ization and then you go in a certain direction, don't ask me too much, but this is just to see it quickly. You go in a certain direction and then you define a step size, which is here, the one inspired by explanation plus.

38:28

And then you update your your vector of coefficients with a learning rate, Eita. And then you update the weights and you keep on doing this until stopping time, so capital T, OK? And then there are results which say that if the learning rate goes to zero and the number of iterations goes to infinity, the outermost algorithm converges to the max margin classifier, which is maximising the minimal margin.

39:09

And that's up to skating, the minimum one norm interpolate there, that is you look at the the minimal use to make sure that the margin is always bigger than one. And then you look for a solution with L1 or minimal. So there are some theoretical results for the relation between the boost in this max margin classifier. When there is no surveillance is of the same order as the number of variables. So that's not our case.

39:43

We have that a number of variables is much larger than the number of observations, and that's really a crucial assumption in our theory. So let's look at this next March, which is the same as this minimum, one norm estimated. And what is a good bound for four to one normal of these estimates? Now, if you look at this carefully, you would say, OK, does it somehow mimic the behaviour of the true vector beta star?

40:14

So if you insert to your beat a star the true factor. And look, think, think of the case Western, where there is no noise to simplify, if you insert here to start, this is just the absolute value of that factor. So this is just the absolute value of of a Gaussian random variable. And then you take the minimum, the minimum of any standard Gaussian random variables. And a minimum of standard Goutman Veritas behaves like one over, and so that's very, very small.

40:52

And you want it to be bigger than one Sabetha star is no way a good candidate for solving this problem. It's completely out of out of scope. Of course, you can be skilled, but then this one, Norm, is going to be very large. So somehow this is a very strange thing that you doing something where you think, well, be the star itself is not a good solution to this problem.

41:17

So why would this be a good estimate or. OK. Um, yeah, maybe if you have time, you can write down, so again, we want about four to one norm of the estimates. You can again, like for basis pursuit, do the dual normal for immunisation and then we get a lower about. So I won't go into details there. But let's look at the upper bound and just briefly explain the idea that, again, what we use is this baseless pursuit idea. And so let me just show you the picture.

42:00

So the true Victor Batista. Is not bigger than one. Right. It's going to be orobets very close to zero at certain points. So it's not a good candidate for this minimisation problem, but what we can do, we can say, well, there is some error. We just look at is this linear function and we truncated at some positive fancy epsilon. OK, then, it's nice away from zero, so now the margin is Epsilon's already something, but then we make a mistake.

42:37

And this mistake, we just consider this error. So don't worry about noise, because we can do that as before. So if but if there's no noise, we have this additional error and this additional error, we just interpolate. XBLA Hetson interpolating interpolating this additional error. So if I take that one. And I subtracted from the star, I get a nice function, which is. Staying away from zero, so there's a good margin and it has to right some.

43:13

So that's the idea. So, again, interpolating some kind of error will help me to fight about four to one Naumov. The estimate, and it turns out to be like this after some some work because you have to renormalise and what occurs is a funny power to power two thirds here.

43:33

But this one, we we sort of know this should look familiar, this looks like the benchmark, except that there are noises at a different place and it's very much also looks like this expression we had for the wrong end for Sheenan estimated. With nonracist, funny power, two thirds. OK. Anyway, we then you have to do some work to look at, not at all, one normed at the normalised factor be to head the L1 normative normalised vector data had to divide it by two L2 norm.

44:09

And for that we use, you know, some some some further empirical process theory. And then we end up with this result, so this is exactly the same as for the plant in Virginia estimated, except that it's the slow rate. And if you have a sparsity, we were unable to to to explore that or exploit it. So we don't have to fast track. We only have the slow rate with about one fourth.

44:43

And it's this one for us, it's like for the benchmark estimate, but the noise is at a different place, so maybe I should look at the table here. So here's the benchmark estimate. And unavoidable base error. And you see that the expressions are similar except for the benchmark estimate. If the noise is small, you can forget about it as far as the excess risk is concerned. For the plant in Virginia, an estimate and for the march, an estimated anois will play its role and.

45:21

We have blonde and forseen and mixed and same result. Well, there's a Lockton yeah, I don't know, of course look there, but we can't prove fast rates. We're unable to boost fast rates for the next March. The estimated. OK, and then, well, just just to conclude, also, we can we have months of asymptotic about on how far this makes margin estimated and is from the other boost output. So this can also be used. To conclude the rates of convergence for the Addabbo algorithm.

46:05

OK, then we worked harder, harder, harder, you know, what we started with was the result with your one twelfth. Now, we got it back to one first, but I think we are now in the situation when we get one third here. So it's probably this is the final answer to one third here. And what we can do so what you can do in these kind of problems is just first proof bound for the urban norm and then use all the results. There are very many results in literature on the on the line segmentation problem.

46:42

And you can use their results. And there's one paper where we can conclude this power, one third. But so far, we do not understand the proof because it's very geometrical. And the paper is not published, so we really should understand the proof, but I think it's correct. So that means the rates can still be improved. And we hope to have that on archives. OK, yeah, um, so conclusion.

47:13

So then either boost or mux merchant type estimated to have the same rate as the slow rate for one bit compressed sensing with linear programming. So the Bonhoeffer Sheenan estimated the confidence is less. It means the you have a very strong concentration for the first union estimates, but for the max merchant estimate, we don't have that because if you look at the minimum of random variable, that doesn't concentrate very, very strongly. And so lower bounds. That's an open question.

47:49

Which I'm not able to answer yet, and, yeah, we looked at some lower bounds, the lower bound for the elbow. Norm, they are the same as the upper bound. So there we are tight. And indeed, if we do the simulations, we see the power one third, which which I had here appearing. So it seems to be a. Something you also see, you can see in data. OK, well, that's that's it. Thank you very much for your attention.

48:22

Thank you. Thank you very much, sir. And do you feel free to use your virtual pause to acknowledge excellent talk? Thank you. Right now, they don't have any questions. Judith, thanks for the talk. I have two questions. One is that I was trying to figure out why people graters and was much more interesting.

48:52

Is it because you have any sort of assumption on The X Factor with the idea on a virus, on the virus, that you have some kind of smoothing effect that allows you to control everything for very large people? Well, it's just it's in the lock. So it's this this phenomenon here. So far, for the maximum of precaution and random variables, we need to lower about that base like squirrelled lockpick. And if you don't want to have high dimensional problems, then the lock will disappear.

49:28

That's all. But that looks good. Now, it's kind of nice to have your look here and it disappears with the lock here. And that's kind of pleasing. So but if you don't mind, locks in the high dimensionality can be of course, you need some high dimensionality because otherwise you can't interpolate clear. Yeah, well, with what we see in the simulations, it's really it's really kicking in the theory for high dimensions, that's still kind of interesting.

50:03

So apparently. Because it plays a role in all the other deprivations as well, and you see it in the simulations, if we don't let BP grow, then. It just. We get completely wrong curves. Yeah, so. Thank you. So did you have a second question? Yeah, I can't remember right now that was the first time we come back. Yes, I was going to ask so can you kind of check these up abounds empirically?

50:49

Like I mean, do they look like they're tight? So you've got this third, which you think is it is is the correct answer now. Yeah. So indeed. Well, we just finished a simulation because it was a very in many things to to to consider. So if you do add a boost, there's a some kind of standard defo value for the number of iterations and all that things. And you if you take the those values, you don't see it.

51:20

But if you do enough iterations, as the theory asks for with a small enough learning rate, then you see it perfectly. It's really incredible. So you see the one third rate, which is in the L2. Now to conversions, and you see the one third rate in the march and you see it's really perfectly, but of course at a certain point, you know, the ultra distance between two vectors, which both have naum equal to one, can never be bigger than two.

51:54

So there are some bounding off the curve, if at certain points, but I mean, if it's really quite. Seeable and for the 11 11 marchin, so we know that the upper bound in the lower round are the same up to Constance. And indeed, so you see that in the simulations, it's. It's really very nice. Yeah, yeah, yeah. Your question, come back to you, the theory of justice, and so it's a bit related to what we've been asking and maybe I just didn't understand what you were saying.

52:32

But so far, the other booster I'm always for following for your last line of your colour, of your books at the end of your table, at the end, you didn't get the results, the rate for the case where you have this false and not sparse seducing me, that it's a technical problem and you don't get the proof yet, but there is a chance that the rate is better. Is it a case where actually you can't do better? What would you do you imagine? I don't know. But in the in the simulations.

53:07

If there is noise, we see that this one we see the one third in the simulation, if there's no noise, we see the one half. So probably we missed some point there. Probably, you know, I do think that indeed, if there is no noise, it should be possible. It's it's just yeah. It's just that we cannot exactly bound the one norm of the estimated by to one or more of the troops, but there's a constant in front that's which is of order one. But still that bothers us. Yeah.

53:51

So it's a very good question. And then we keep on thinking, oh, yes, that should be possible with the. So far, we didn't succeed, so I gave this talk in the over woolfenden once and I was there tonight and he just shook his head. He said so probably. But that was the one 4th. So, no, I don't know. Shook his head at this. You think you thought it was a rubbish bound or you so you said it was a hopeless business. I don't know. I want to ask you why they didn't make it OK?

54:34

Well, yeah, no, it's I keep on working on that one because I don't know about lower bonuses or so I don't know. Mayor. You do have a question. Patrick. Yeah, thanks. Thank you, sir. Enjoyed the talk. Yeah, so I had a question on the connexion with other books as well, just because it wasn't clear from the slide what you're really doing with it. So I understood that you look at the limits of it one, but at that point you have exactly the maximum solution.

55:18

Yeah. So do you do you do something with at least stopping? Yes. At least we have some kind of theoretical result which says if you have like a log and something at least and the learning rate of descent, this order, the number of steps that in that order, then the margin coming out of this algorithm is getting that much larger or smaller than the margin coming out of the next margin algorithm.

55:53

So there are asymptotic results, but we sort of refine them so that you can you see the non asymptotic relation, but I didn't put it on the slide. That's right. And and so within this picture, you you are able to prove that if you do this type of early stopping, you achieve better bounds as a function of.

56:17

So we don't so the theory says you should not do something, you should do enough iterations, so you go on, you do your Arabised algorithm, then at certain point you start interpolating and you continue with your outermost algorithm, enough iterations. And then the theory says that you are close to this max march and solution.

56:43

We say exactly how we can give some bounce on how close you are then, and we see it also in the simulation, so if you do the Audibert algorithm with small step size learning rate and enough iterations, you see the groups are hardly you can hardly distinguish the curves. I'm sorry, these simulators are just done, so I didn't put them on the slides, but it's kind of strange if you do the utmost algorithm with large step size, it is giving much better amounts.

57:16

So it seems to be. It's even still interpolated with better classification error, so smaller classification error, this I don't know why this is what the simulations say. Yeah, I will. So I'm familiar with the literature on regression, and I believe it was Scootin William, but I they they have some paper connecting early, stopping with notion of M. Max optimality in that setting.

57:51

So yeah, I understand now you're focussing on interpellation and you want another figuration for this phenomena to kick in. Yeah. So you're you're doing the really going way beyond what is allowed for classical statistics. You don't do early shopping, you just go on with your iterations. Yeah. Yeah. And that's exactly what helping you should be seeing better and better rates as you do with your benchmark estimate.

58:24

That will be money. Well, there's one thing the rates can be better because of that, because you can do something similar with regularisation, but then you can't computed. That's the point. So theoretically, you can sort of minimise some sine function, but that's a hopeless, non convex problem, Zozo. You're thinking of knowing where to stop. No, if you were to.

58:52

Oh, no, no, I don't know if you know where to stop, then I would say this gives you a. It's a good question, so if you look at the benchmark, let me just go back. I would say, knowing if you do early stopping, that's like regularisation, so just out of some vague idea, I would say it's it would give you these rates. Yeah, which is still worse than than the interpolating rate, which kind of strange. And. In a sense. Yeah, and. So something to look at. Yeah, yeah.

59:43

Thank you. All right, thank you very much, sir. We put a stop to let's all applaud again. Excellent work. Thank you. Thanks a lot.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript