Neural Networks and Deep Kernel Shaping - podcast episode cover

Neural Networks and Deep Kernel Shaping

Apr 05, 202255 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping. Using an extended and formalized version of the Q/C map analysis of Pool et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the "shape" of the network's initialization-time kernel function. We then develop a method called Deep Kernel Shaping (DKS), which accomplishes this using a combination of precise parameter initialization, activation function transformations, and small architectural tweaks, all of which preserve the model class. In our experiments we show that DKS enables SGD training of residual networks without normalization layers on Imagenet and CIFAR-10 classification tasks at speeds comparable to standard ResNetV2 and Wide-ResNet models, with only a small decrease in generalization performance. And when using K-FAC as the optimizer, we achieve similar results for networks without skip connections. Our results apply for a large variety of activation functions, including those which traditionally perform very badly, such as the logistic sigmoid. In addition to DKS, we contribute a detailed analysis of skip connections, normalization layers, special activation functions like RELU and SELU, and various initialization schemes, explaining their effectiveness as alternative (and ultimately incomplete) ways of "shaping" the network's initialization-time kernel.

Transcript

So, hi, everyone is welcome to this week's Oh, I see, I smell. I'm in awe. So today our speaker is James Martin from the mine. So James is research scientist. I do my working on a deep learning fundamentals, training algorithms and theory. So before the tragedy that his Ph.D. from University of Toronto under the supervision of a job, Hinton then retired a demo. So it seems. Go ahead. Thank you.

Yes, I'll today, I'll be talking about a recent project called The Rapid Training of Deep Neural Networks, that normalisation rallies or Skip Connexions. And you see my great collaborators down here, most of them are actually all of them are out of health a bit, some at DeepMind and some brain. So deep neural networks have become ubiquitous in modern machine learning applications.

You see them in reinforcement learning agents, translation system, things and language systems, vision systems, speech recognition systems in search of recommendation. And it's pretty much pretty much all over the place now. Now, while practitioners of neural networks have come up with many heuristic innovations that make them trained on at higher debts and which are very useful in practise theory hasn't had much to say about this.

It's been quite slow to catch up and rarely do you actually see theoretical insights making an impact on the practical application neural nets in these contexts. So currently, defence seems to require some combination of the following elements to train fast normalisation layers, such as batch normalisation or layer normalisation, Skip Connexions, also known as Shortcut Connexions and specific choices for activation functions such as value and sell you.

Now this comes with various problems. First of all, the mechanism of action of all these elements is not particularly well understood. There has been progress in this direction, but I don't think we're anywhere close to a complete picture yet. It's unclear also how to use these elements in new architectures, partly because we don't understand fully how they work.

So, you know, if you're if you ask the average person, the average practitioner, you know, why don't you just put on normalisation layers between the blocks of a resonant baby? What harm could it do? It actually does a lot of harm. But it's very not obvious why. And it's actually this very particular recipe that is used in resonance that's actually surprisingly effective for reasons that are sort of nothing particularly to do with the individual elements,

but actually just the combination. They're very good in that very particular way that they combined in. There was an architecture that Shoreham in particular has caused problems in certain domains where the information sharing that you have over the many batch leads to degenerate training that you see in certain kinds of things. So certain kinds of generative models are self-supervised models and. Also, skit Connexions will change the inductive bias of the model.

This may or may not be desirable depending on your application, but it's kind of annoying that you have to include them for your nephew. You don't want to trade at all. I would say more speculatively, these techniques might be acting as a crutch, and our reliance on them could be holding us back from pushing the practise in theory and deferring to the next level.

I can't. In particular, if we don't understand where they work, there's really no way that we can sort of push a state of the art beyond just random exploration. So in this work, we develop a method called Deep Kernel Shaping Diecast, which is a general automated framework for transforming neural nets so that they have better properties and initialisation. And this will make them easier to train.

So the headline result is that decades enables rapid training of neural networks that are traditionally considered hardware possible to train. And this includes very deep vanilla convolutional networks. So the vanilla here, I mean, that's without better or worse connexions networks with an unpopular activation functions such as teenager software. And also this this work sort of reveals why those choices are popular all.

And you know, and we'd like to speculate that this approach will be very useful in developing new models because it sort of removes some of the requirement of architectural features in order to in order to enable fast training in the in the work, we also provide a comprehensive explanation for why things like real use spectrum layers and skin connexion speed of training on the show, how to cast makes them at least partially unnecessary. And you can see the paper for that.

So in general, diecast supports fully connected, conditional cooling, weighted some layer norm, an element wise nonlinear layers, although the only moisture linear layers have to be preceded always by a linear layer. That's a that's a requirement or accomplished layer, and it supports calcium fannin of standard as if in initialisation and also orthogonal visualisations.

Strictly speaking, these have to be of the delta type wherein which means that layer you basically just zero everything put the centre of part of the filter, although in practise that actually this approach seems to work OK without the delta in it. But it is a formal requirement, at least for the theory. And then we also assume that the bias is initialised to zero in those types of weight sharing are actually supported by an approach such as what you see and what steps of artemz.

And this is a kind of owing to some recent work done, like rigging, showing that a lot of the mathematical tools we use are actually applicable to networks that have weight sharing. Previously, that was actually not well understood. And actually, the current draught doesn't even reflect that so that it doesn't work for Arden's. And also the approach supports arbitrary topologies, such as the multiple branches heads. You know, it also supports networks that have state connexions.

And so for this talk, we're going to simplify things a bit just for the sake of clarity. We're going to assume only fully connected layers and element why so many layers and a very simple fit for no apologies. So basically your standard of connectivity and we're going to have we're going to assume that our network inputs are normalised to have a norm, which is equal to the square root of the dimension, which is a pretty standard thing to assume.

So the mathematical basis is not that it comes from sort of the theory of criminal functions for deep networks or criminal approximations. This is these are approximations that apply when that work is randomly initialised. So in particular, if you let f be a neural network function that gets its vector output, give it an input x and here we can just take the output to be, you know, before the logit layer.

So women, when the layer is still pretty wide and doesn't depend on the output or the target output dimension. So at initialisation, it turns out that you can approximate uh. It could also can be my most crucial. I'm not sure if I understand if it comes through. Do you see that? Yeah.

OK, that's good. Um, yeah, so you can approximate both this squared norm divided by the dimension or an inner product between the output of the network for two different inputs X and its price normalised by their respective norms. You can you can approximate this using only knowledge of the network structure and the following three scalar parties, which are just, uh, squared for the two different inputs and x prime. Uh, and also there are two products to modify their product to their norms.

And by the way, the quantity is just a cosine similarity. If you're familiar with that, as is the quantity and so, uh, so yes, you can do this computation and we'll call these squared norms cube values and these cosine similarity quantities C values. And we'll say that they're computed by functions called Q Mass and seems to take the input value. And so the key value and or at least a good. Approximation of it and similar statement for syntax and see values.

There is a hint of hints of humility here, which I've sort of swept under the rug, but it turns out that with the data processing we use, you can do that. And I should say this approximation gets better as if the width of the layers grows. So, yeah, so this is here. These are yours. This is your sort of your standard. A deep kernel approximation sort of boils down to to its essence, having sort of stated what these things are in terms of what they call what they approximate.

Still haven't described, you know, actually how you define them and how you compute them. So to compute a cue map for a semantic layer, you just you just say that to be the identity function. So that's trivial for a nonlinear later gene activation function inside a map. It's just given this formula. We just assume a one dimensional gives you an expectation and you have this here. We see maps a bit more complicated to image this in expectation.

It's also important to actually write these formulas for this talk. Really, they'll really take away that you need is that we can actually commit these if not in closed form, at least numerical integration. And it's pretty efficient because these are only one two dimensional Gaussian integrals, as you can actually compute them reasonably fast up to a very high precision for arbitrary files. And also, we out that you can compute their derivatives as well, which will be important.

So having to find that didn't seem pretty visible, whereas in our network, we can actually define them for our entire networks. And we do that by a simple composition. In particular, the key map for a for a composition of two networks, F and H just ends up being a composition for the individual F and H, and similarly for Siemens. By the way, if there's any, if any of this is unclear, please let me know now because the rest of the talk sort of relies very heavily on this.

And so if this if this is not clear, you're going to have a hard time finding the rest to talk to you. Please let me know if you have any questions. So in general, she noted that she and Synapse are only valid descriptions of past initialisation, that's very important to underline general these. You can't actually predict the output norm given the input norm. It just doesn't work. All right. So having defined came out to see maps, we can start to examine them for words.

So just to recall, see map determines essentially the angle because it's the cosine similarity and or in the distance because even derive the distance from the angle if you know the norms between to open the doctors from the network as a function of the angle to the input factors. And so in the big networks, we see that see maps become degenerate so that information about the input angles is essentially obscured. In other words, it's hard to recover. So you can see that it's up here.

This is a these are C maps for deep really networks at different depths and for a stellar looking network. It's a pretty reasonable function. You could have any time inverting this to recover the value of a portion of the C value. But as it is for deeper clutter, clutter still technically in a vertical, it will be very hard to convert it in practise under any kind of approximation.

Boys And of course, because these maps are only approximate descriptions of the network's behaviour, that will be of a concern. So essentially, what's going on here is that I give you anything but see value of the WS. Basically, you could not depend on the input value will be so weak it'll just be swamped over the noise. So you've really just lost that information about with the ACLU. In other words, you've lost information about the input distances in the output space.

Now it's obvious that that's going to be a problem, but it turns out that it is. It turns out that this this situation sort of dooms gradient descent learning. So in general, a degenerate sea map is one that squashes the entire range of imposing value around some output value.

S. zero. And there are two basic cases, all you see is zero, both of which turn out to be bad for treating either the value that you squash to is significantly less than one one in the maximum possible value because these are cosine similarities. So if it's less than one, then what that means is the slope. Actors basically look random words. They all they're all sort of from each other approximately on it, regardless of how close or far the corresponding input vectors were.

And so that makes it look kind of like a random hash of the input. And while you might be able to learn from this generalisation, it's going to be impossible because they're just they don't reflect anything about their inputs, essentially or anything useful. And also, this condition will imply that early layers will have huge gradients compared to later layers, and that will actually make optimisation tricky.

On the other case is that you've got your Europe put so severely that you're squashing towards as close to one or equal to one. What that means is that all the open sectors are basically going to look identical because of course, the value one corresponds to a close in similarity, one that means the vectors of the same assume that they have the same norm, which in this case, they do.

And so, uh. And so the implication of this is that gradients earlier layers will vanish and the lost surface will become illegal conditions, making optimisation basically impossible impossible. Um, this can be formalised using very techniques such as UK theory, and this is done in the paper. There's other people that have also looked at this phenomenon and sort of tried to argue why it's it's it's bad for training and analysis of practises that seem to hold up as well.

All right. So the previous solution to this problem, and this is a paper that sort of first observed this phenomenon is in the in a method called the edge of chaos. And so in that approach, the solution is to require that the derivative of the sea map for each individual nonlinear.

It is equal to one when evaluated at one, and it brings out the condition will slow asymptotic convergence of key values to their sort of assumption of value one as death increases, and in particularly that convergence will go from being exponential to being so exponential. And there's a lot of it in animal systems, analysis of the composition of many of these functions. Which cares a lot about a lot about the slope in the limit.

Unfortunately, though, given the deep enough network, see values will still be pretty close to fully converged. So in other words, the networks see map is still going to be highly degenerate and as an example, the deep railway networks that we studied before and previous slide actually already satisfy this condition. Rail news out of the box If this policy is a biased initialisation zero satisfy this condition.

And yet we know that a deep enough rail network becomes untreatable and then you have to do that. You can't see maps unless it is. So much of that which we're up to this point, we're not assuming, right? So the I'd say the main contribution of this work is sort of a new way of controlling cement properties and. Instead of looking at sea map for individual layers, we're going to look at the sea map for the whole network.

And we're going to analyse it from that perspective. So, eh, so there are there's a way to formalise this. But the intuition can be seen to be a situation that see maps or convex on the zero to one.

And this is just a fact that you can prove. And intuitively, what that means is we can control the mission of the network's overall sea map from the entity function by just controlling its slope at the maximum value one, assuming that we know its value in zero and we set it to zero, although you could visit, it would also work if you sort of set this to some, particularly it's some particular value that's like significantly less than one.

So carefully see, you can see this in this sort of picture where it's like, if I fix the graph to this point and I vary the slope over here, that sort of a one to one correspondence on that how how much the curve deviates from the identity and in particular, how much it flattens out becomes to generate around an output value in this case of zero.

Right, so, so, so, yeah, so controlling the the network, the networks seem derivative at one gives us a way to prevent degeneration and we can pick a value as long as it's the slope isn't too extreme, it won't be degenerate. Now, you can't formalise this, and so this is a pretty lengthy paper, basically just says her best condition of value, zero is equal to zero. And the deviation of the identity number is a function of its derivative at one.

And also, if the derivative of the cement can also hit a deviation from the derivative of the identity, function can be found in a similar way in this whole or the entire input domain, not a zero one. Right. OK, so we have a reasonable solution for, it seems. Unfortunately, though, there are still other ways that the network can be failed, failed to be trainable. One of them is that networks that are nearly linear.

And so. First, if you can observe that linear networks have they see maps, but their model class is very limited, in particular, linear networks actually have identity segments, so that's sort of like the perfect night each other and see map information is preserved as well as you can. But you know, a linear network is not going to find interesting solutions because it's intrinsically limited.

You could say, Well, let's just ban linear networks from our consideration and just stick to non-linear networks. Problem, though, is that you can make a network of nearly linear in a certain sense, so that it'll have a nice simple but be almost as hard to optimise as a as a linear network, which in fact is impossible to optimise at least up to the performance that you want.

And so one example of this is you can take a railroad network and for each billion activation, just add a small or a large, very large constant to its input and also subtract the same constant from its output. So essentially just transforming the rallies. Now they're going to basically behave like the identity function because for all reasonable inputs they use,

they'll be less much less than this constant. So if essentially just gotten rid of the the left part of the value function, the negative part of. But. You know, you can actually prove it's very quick, quite easy that you could with a certain traits, weight and biases and essentially undo the situation that we just did in recover a stent could really work. So the model class hasn't changed here, but obviously those young protesters a struggle in this situation.

In fact, it'll probably never even evaluate the network for four inputs that are sort of in the nonlinear region of the real use. So, so it's basically just going to be like optimising the linear networks and you're not going to get any any of the benefits of using neural networks.

So to prevent this problem, which can manifest in any type of network, not just reality networks, we were quite clear that the derivative of this imagine for the moment in the earlier nonlinear layer, if all you want is maximised OK to condition the network, but overall see, map and interpret it. So there's now there's this tension, want to make the derivatives large individual layers, but we want to we want to build the derivative for the overall networks to be smaller than so constant.

And they'll be sort of a way to compute this appearance of this trade-off. Yet another failure mode is that our approximations that we're basing all of this analysis on my not mine at all. So, you know, and in that situation, nothing that we're even talking about makes any sense. Unfortunately, error in these of approximations can actually get very high in deep neural networks unless you make the with extremely large. In the worst case, the dependence could be exponential.

Requiring an exponentially wide network to it is a function of the depth. So that's not tenable, because, you know, networks can get quite deep these days. Now you can see this issue, perhaps. Obviously, when you think about values, so let's say they're mapping you maps and how they can be sort of vulnerable to errors. And then you look at the map up to first order. The evidence output is proportional to its derivative times, the error it's in the book.

So he maps will amplify any any errors that they've that they've said that are their input. And if this derivative can get very big in a deep network, it turns out. In general. Offence, if you think it is just, you know, this problem is just a cute values, unfortunately, that doesn't really work. Values are you did wrong. See, map competitions also become essentially meaningless as well.

So we need to handle that problem. So the solution that we use indicates is to require that the derivative of the cue map is less or equal to one of four values of cue that we expect to see. Turns out, we can't actually enforce this condition that we or someone reasonably close to it. And so this will control the compounding of errors in deep networks of these kind of approximations.

OK, so now we've sort of identified these various failure cases and ways to manipulate the queue and see map in order to prevent them. So that leaves those sort of conditions which define decades. And these and we'll discuss now. So for every sub network asked by some network, I just mean some component of the network, including the whole networks that sort of defines a well-defined input and output.

So you could think about, let's say, if it was a multi layers three four or five four a seven network starting with layer three or five. And, you know, more general arbitrary structures and networks have more interesting examples of Soviet work to them. So the so in essence, the network itself is this is something we're typically.

Right. So first condition is something that be discussed before it's more of a convention that we go with, which is that input values of one Q two values, they they have one man to hold the key values of one. And what this says is that the networks layers preserve the norms of their inputs at least once you account for the dimension of those factors. So in other words, it preserves the Q values, which is where norms divided by the dimensions.

And it does this at least if the if the yeah, the values want for other values, you can't necessarily say anything. Although it should be noted after reviewing it works. You get that this is the identity function for free, essentially, because really, no, it's kind of a preservation of their skills from their input to their own. But. Right, so and we go with the value of one, just because that's a common convention.

You know, we could amount to to two or five to five weeks, but we just have to standardise to some vector length on or to sort of do everything else that we want to do.

Also, by doing this, we prevent the problem where you've got an exploding or vanishing vector lengths for your your your activation vectors in networks, which you know, if you go right to the end of the radio network could lead to a very small or very big input to your lost function, which could be to an America problems or optimisation problems, depending on the type of loss of. In a value of one is kind of what most standard loss functions expect.

So. All right, so that's that's condition one condition, a second condition is what we just discussed previously, which is they require that they come out of it, which is as well behaved and are in this control colonel approximation error. Now previously basically, we wanted this to be less than or equal to one for all potential capabilities of Q. Now in general, we only expect to see one type of Q value in our network, which would be equal to one.

Of course, due to random errors that will sort of break down. But as long as we're close, this map is sort of continuous and smooth it. You know, enforcing this condition at one will be sort of good enough for another close to one as well. And we said that equal to one, not we don't try to minimise it because it's equal to one. And this turns out to work best in practise. You could also have said it less or equal to one or try to minimise it.

That would also control the current approximation error, but we find that the Eagles one just seems to work best in practise for reasons that are not totally well understood. This is where the maximum value if you set it larger than when you run into problems. OK, and then we've got conditions C and D. These are the conditions that we hope for that prevent seamount degeneration.

Um, so yeah, so it's a study of equal to zero and zero and also restricting its its derivative at one to be less or equal to some constant. And oftentimes, this is sort of like 1.5 or just some moderate value. And finally, we've got this condition here, which prevents the nearly linear networks problem. This this is that the the derivative came out through nonlinear layers is maximised subject to these other conditions. So put. So for A, B and C, it turns out that you can.

It's efficient to have these conditions hold for cutesy maps of non-linear letters and get a free for all sub networks, provided that, you know, formalise quote unquote any way that sums the network. And I'm not going to describe what that means, but it's a straightforward operation. And then, Danny, the combination of those turns out to be equivalent to enforcing the condition for each nonlinear layer.

So we're setting the derivative of the cement at one to be able to this constant to the power one over T. Where does the depth of the network for or more arbitrary topologies? Is there is a more complicated formula here. This is the one for M.P.s. But it's important to know that this formula can be easily competed in the more general case, so that's not a problem.

Right. So I talked a lot about conditions that we want to enforce first on the cutesy maps for the network and some networks, and then for translating that into conditions on hotels. I think it is. There's someone asking a question. Oh, sure. Yeah, yeah. Hey, that was me on the pace on the slide. Before you have these different conditions, I think Slide B. Sorry, Point B was the thing that controls a kind of approximation error.

Not did you say this? Not only it's not only that for the validity of the analysis, it also has an impact on performance is that we said. Yeah. So we need the curve approximation here to be low in order for this analysis to make any sense. But but you know, you could achieve low colonel brown application error by requiring this to be less than or equal to one, and you could try a minimum minimise and in fact minimise it. It will minimise the approximation error.

So why didn't we minimise it? Why? Why do we actually set it equal to one which is actually the maximum permissible value before you get run runaway approximation error? And the reason why is because it works best in practise in terms of the overall effectiveness of these networks at the end of the day. We don't have a good explanation for why that's true. This is this is sort of one of the remaining mysteries.

OK, so so A, B, D and E, or let's say necessarily for the sake of argument to have a performant network. But so a CD and a b, there might be networks to satisfy the other conditions that a good initialisation to train and so on. But the do not look anything like kernels. Um, maybe.

Of course, once once the kernel approximation kernel approximations break down, like none of these conditions really make any sense like they're, you know, they're describing things that are not descriptions anymore of the network's behaviour. Right. Okay. OK.

It's certainly possible that a number of that is sort of well outside of the kernel regime could be, um, you know, could could train well, but then it's it's much harder to talk about it like we just don't have the theoretical tools to really analyse it at that point. OK. Thank you. Right.

So, yeah, so having having reduced these conditions and I can't see maps for some networks down to conditions in the cutesy maps for individual non-linear layers, I still, you know, we still haven't actually about how how do we achieve these conditions on the cansee maps of non-linear letters like what are our levers of control? And so for that, we're going to transform the activation functions in a fairly benign way.

I would argue, in particular, we're going to introduce non-tradable to constant. It's for both the input and the output to each activation function, so in particular, going from side effects to this where all of these gamer health beta and delta are just fixed non-tradable scalar constants. Um, because you can always carefully choose your weights and biases in your network to simulate the kind of transformation.

This will not actually change our middle class, at least not assuming a perfect optimiser space of functions computed by the network is is the same now in practise. Of course, doing these kinds of transformations could change the inductive bias of the model under a limited optimiser like gradient design. So a couple of examples of transcriptome deactivation functions are plotted here. This is for a vanilla 100 layer on LP.

So in the case of software, as we go from this sort of familiar to real use, soft, plush, a softer, more gentle curve. And we see something even more dramatic for 10h. And in general, that's what this method is going to do. It's going to take a maturation function, which is quite nonlinear and kind of tone it down to be closer to a linear function. I should say this is over an input range with a typical range of input factors sort of beyond the typical range of inputs you'd expect to see.

Debates will typically approximately follow a Gaussian distribution. And so would with a variance of one. And so once you get up to negative 10, essentially it's negligible probability that you'd ever see an input of that size. So, you know, the steeper the activation function occurs the central region. If you go further from this graph, you'll see that this is still this is still a teenager here.

If I were split it up much, much further, you'd see eventually some token button up, but but rather work actually matter in practise. But it basically looks like this kind of software, as it does here as well. Um. All right. So now, having describe the approach, how about the experiments? So our basic setup is that we are training the resident one to one v two style architecture on image, or they do this with and without that storm and with a skip connexions.

Batch sizes five 12 learning rate schedules were optimised dynamically using a method called Fire PBT developed at DeepMind recently, and this was done to with the particular objective of maximising optimisation speed. So in other words, the choice of that kind of learning rate schedule is not a confound with regards to optimisation. We know why did we examine optimisation speed and generalisation performance?

Well, the goal of this work is was mostly just to make up the gap between resonance and and networks that don't have all of those architectural flourishes in the main gap that you see there is actually optimisation speed. In fact, the networks without connexions in better if they're made deep enough. Basically, they don't even train at all to the time. They'll just sit and zero performance. We do some footwork that looks more at generalisation performance.

That was actually recently accepted that I here, I'm not going to talk about that here. So, yeah, and the other optimisation parameters were tuned lately, not as much as learning rate, but even even at Alphabet, we only have limited capabilities, although what you would cry if you saw how much computation I used. And yeah, there's lots of experiments in the paper, tons of observations and different different things that we studied in relation to this approach.

In fact, that makes up sort of the bulk of the paper. It's just experiments. So the main result is for vanilla episode of this network, let's get Typekit without connexions and short using chaos and comparing those to resonates. And so a standard resume that is this ludicrous here to even see if the network is in stock plus or change are slowly.

And meanwhile, really universe where you where we've stripped out either the ship Connexions the shore, we're both perfect, but first we do need all of those elements. You can't just rip them now, insults or laser, I should say, and that's a very important point to make. If we go to see, the key fact is it's optimised for neural nets. It's a non diagonal approach. It's pretty powerful, but somewhat expensive.

The if we go to networks of optimisation, the situation isn't as nice standard risk that is optimised about the same rate as it was with K-Fed. Maybe a little slower, but still it works with us are now truly behind, although they're still doing much faster than the bricks that don't stick. But the gap there is now a gap with that with residents. So trying to drill down into this a little bit more if we look at it.

Using Connexions with chaos, if you're using caretaker's network, once you introduce chaos and fat, Skip Connexions don't seem to matter at all. Here we're using Skype Connexions that have been chosen to have a residual weight equal to this constant. And the rate on the shortcut is is such that the sum of the squares is equal to one. That's a requirement of this method. And it turns out that actually, you know, well, yeah.

So all of these approaches are met the performance of a sort of standard resonance that we've got to. That's where things get a bit interesting. And now we see that, in fact, we can obtain the same performance as stand resident using the case just by reintroducing Ship Connexions into these networks, at least, at least at least in. It's a thoughtless activation. It's intended for some reason behaves weird in this experiment. All those I should note that almost all recognition functions do well.

So it does seem that, uh, you know, with with these conditions, plus de cases is good enough to match resonant performance. But your other option if you don't want to use get Connexions is to use CVAC. Now we can apply to gas to networks with a whole bunch of different activation functions. And actually, we see that many of these education functions, including a lot of ones that work in typically very badly or don't even train at all in such deep networks work just fine.

In fact, they all work pretty similarly, except for real, you know, real news actually trailing behind here. Somewhat ironically, actually, because it's not really compatible DKA on top of that, but the kinds of transformations that we do in the activation functions are a limited power in the case of value activation functions. So, so it's in some ways this isn't really the performance of class at all, because it's not the method isn't really working properly in the case of real news.

We can also look at the effect of using different optimisers because we need this strong dependence on OPTIMISER, at least when we're talking about networks. So don't skip Connexions. And we see that in fact, key and shampoo or a modified version of shampoo or both doing very well and match the performance of resonance, whereas Study and Adam perform roughly similarly and do not allow us to replicate the optimisation performance of residents.

We can also look at some previous work, for example, it's the edge of chaos method, which is sort of the main is preparing for chaos. And yet we see that. This is clearly performing better in this in the case of these teenage networks. This is what key facts, although the the gap persists and gets even. Bigger with my computer would load the grass there. And in fact, yeah, in fact, the U.S. is going to be doing that much compared to Saddam.

And in this context, the whole thing is this, by the way, are below presents. Standard resonance with us should be. We can also look at looks linear, which is the method that tries to initialise the network to be exactly linear and initialisation time due to certain weight, symmetries and the use of a radio activation functions. And this is nothing, if it turns out that it doesn't seem to work well with.

I think because in fact too aggressively breaks the symmetries that you have, and as a result, the network enters this very nonlinear behaviour too quickly and sort of things go off the rails. So we have to use looks linearly with Adam, and in that case, to perform that there's a clear performance gap. And this is partly just because we're using CVAC, which works with CAS. The situation with SUV looks linear, it much more close to the case, I would say.

Right. So, yeah, so this is coming to the end of the talk. Current limitations of this approach. We do not have support for multiplicative units like you see in Transformers. But I think an extension is very possible and quite interesting, actually. Vanilla networks, that is next. Let's get predictions about you're using. Your do seem to generalise words, at least in these experiments. I didn't talk about generalisation performance, but that is that is an observation we make.

Although that's largely been addressed in follow up work, that was this clear paper. And in particular, if you just change the way you do the optimisation and also you make some small changes to decrease, you can actually close the gap to standard resonance almost completely. Measurement, the speed of resonance using the networks, we had to use crack.

Otherwise, we had to reintroduce the Connexions and the general attitude on these and all networks will require at least two explorations and interesting and perhaps more, depending on your. Trying to understand that, I think, is a very interesting question for future work as well. Maybe, maybe with the right tweak to this approach, we could actually have it perform just as well as residents would use in a study.

Right. So I think the outlook is pretty good for decades and it could be a useful tool for unlocking new model classes. I think that's the primary location here, allowing you to build the design your models without having to sort of rely on some conflicts of tricks to sort of make it optimise faster for reasons that you hopefully in your stand. And it also should help enable existing models that have optimisation issues to train better.

And we've sort of started to look at that actually at the might. And also, you know, if you if you do, you have models where tricks like Bachelor Skip Connexions are causing problems or can't be used, this method could be very useful in those contexts. Right. So there is a there's a paper on archives. It's long, but I'd say it's actually not very dense and it's very self-contained.

So hopefully if you if you're interested that you'll find it an easy read and also a lot of the length is just in terms of the experiments. And there's an official augmentation which is going to be on GitHub quite soon. And here are some of the work that inspired this project, and I'm happy to take any questions. His successor, James Foley, a wonderful talk. So there's one question in the title window, so is quite long, so I was just a reader.

You showed, remarkably that chaos plus any activity function seems to perform similarly irrespective of the closing of our function. But is is it surprising at all since you have shown in reading them matter as opposed to a tightening or a softer pass plus decay space, they look the same in terms of resulting activation function?

Yeah. You've idiocy now surprising? Well, when this kind of a phenomenon to tell us is rather one much smoother activity, active vision function would actually work best than anything else. Yeah. Well, OK. So there's a few things there, I would say. It still is somewhat surprising because it's not obvious that when you train these networks, that they're going to stay in these regions of description that we develop these, you know, Q maps to see maps, right?

The inputs to the activation functions could get much bigger or much smaller than what is predicted by this theory during the course of training. Now, that won't happen in the entire team, but not everybody believes that that works David Typekit machine when you're training them.

So I think it's it's still it's still not obvious that this this would work despite the fact that, yes, as you pointed out, the activation functions do look similar to each other once they're transformed, at least in the region where the cardinal theory says that, you know, behaviour should matter. Um, the uh, that another thing is that like it's not good enough just to make activation functions smooth.

I mean, this approach is making them so very, very particular and delicate way, and it's very easy to trick yourself into thinking that you can just eyeball these. Plots for the activation functions and know if they're going to do well. Trust me, that's not true. For example, if, say, a soft place looks very similar to a rail, you when you just look at the graph. But in terms of the criminal properties that are wildly different. Is there any other question from the audience?

You can just email yourself. Yeah, there's one question. Yeah, it's very interesting. I was wondering, so it's more just to understand better the work when you are enforcing the conditions that you the four conditions despite five conditions that you you tend to find, like with the the sea map of being zero at zero. Do you work primarily on the mission functions or do you also work on the initialisation weights?

So I was thinking I'd like to see my fellow remember it was defined by the variances of the weights in the original paper. Yes, it's also, I don't know. So we assume that the the variances are fixed for the weights and we do everything. All of our manipulations happen on the activation functions.

You can. So it turns out that due to that statement that I made about about these transformations sort of being something that you can replicate by manipulation of the weights and biases, you could actually transform this approach into one that only sets the weights and the biases of the activation functions, although I should point out that it will require the weights and places to be non-independent,

which sort of departs from the old literature, which always assumes that they're they're independent. The you so you can do that. It's cleaner, I would say, to think about in terms of activation transforms it just leave that distributions pretty vanilla. It also works better in practise. That's another finding. So because Kathak is invariant to that kind of reprioritisation, if you push the transformation out of the activation function and into the way it symbolises flexibility into that.

So in fact, the performance will be quite similar. But for be pushing the transformation out of the activation function and into the way Tobias is actually makes it much worse of this method. And that's again, it's not invariant to that kind of thing. And why is that the privatisation that does the manipulation inside the activation function different from the one that does it outside in terms of study optimisation performance?

I don't know. I mean, you could probably you could make an argument in terms of like the condition number of the A.K or something. But but it's something that we've studied in depth, and it might be, you know, that might that difference might be the key to sort of making this approach work even better than it does with it with speed.

Because if you could, if you could make the resulting opposition landscape even better conditioned without Connexions, that would perhaps enable us to to to get rid of fat from this equation. OK, thank you very much. I think we are out of time, so let's ask the speaker again. Thank you.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android