Do my best to do it remotely. OK, so I work in an interpretive machine learning, and over the years I've seen the field of machine learning lean more and more toward more complicated models. Even in cases where they're completely unnecessary. And so I didn't just want to give a talk that says stop using black box models because that's really not constructive, right? That's just destructive. It's just telling people what not to do rather than what they could do.
And so I wanted to tell you, not just don't use black box models, but I wanted to tell you why you don't need them. Well, now let me give you an example of recidivism prediction in the U.S. criminal justice system. Now here they use predictive models to determine people's risk of being arrested in order to determine whether to release someone on bail or parole or for social services. And in some cases, the models are so complicated that it's easy to compute the predictions incorrectly.
So this is an article from The New York Times about a case where a typographical error in the input to a black box predictive model led to years of extra prison time for someone. And the typo was in his criminal history feature as he. Was denied parole. He left, he compared his scoresheet to someone else, and then they found he found that there was a typo in the criminal history features. Anyway, the model that the justice system was using for his prediction was called compass.
You may have heard of it. It's a very famous black box model used in the justice system. A lot of arguments about it, and you'd think that models like Compass would be more accurate because they use over 100 features and they're created by a company whose job it was to do that. And so, yeah, you think these models would be more accurate, but they're not. So we didn't experiment from Florida to test the accuracy of this particular black box model.
And by the way, compass, it's pretty widely used across the justice system. And yeah, we compared compass to our latest machine learning method in the lab at the time of this experiment, which is called Corals. Corals is an optimal decision tree method. It produces really sparse, one sided decision trees, and it came up with this machine learning model that fits in the bottom of a PowerPoint slide.
And the model says if the person is young and their male, predict arrest within two years of the compass score calculation. Also, if they're a little older and they have two or three prior offences predict arrest within two years of the cup a score calculation. Or if they have more than three priors, predict arrest, otherwise predict no arrest, and we looked at this model and thought, OK, that's pretty simple. But how is it possibly going to be as accurate as Compass?
And it was. And so what I'm showing you here is that the models are about equally accurate. This is 10 folds of data, so each colour is a different fold here. And yeah, the performance is very similar. But not only did these two models perform the same as it turns out, no matter which machine learning method we tried, they all performed the same.
And some of these are complete black boxes like compass, which is proprietary, or some of them are just black boxes because they're just huge formulas and you can't fit them on a slide like support vector machines with radial basis, function kernels and decision trees, red forest and so on. So, you know. There was this huge debate about the algorithmic fairness of compass. But the truth is that we just don't seem to need compass at all. So why are we still using it right?
Now. Back to my point here. So there's doesn't seem to be any benefit from complicated models for re-arrest prediction and criminal justice. There's a lot of literature on exactly that problem. There's just no reason to use the black box model for for criminal recidivism prediction, as far as I can tell. But it's true that there's no benefit from complicated models for lots of different problems. And I'm listing here a whole bunch of problems that I've worked on in my career.
And for none of these problems have we seemed to to need a black box model. Now it really depends on your your data representation, though, because like if you're if you're working in computer vision, neural networks are really great for computer vision. They're great for when you're data, you know, when you need to create a good representation for your data, that's where you want to use a neural network.
But if your data naturally come with good data representation, like in all the problems I've listed here, then all the algorithms tend to perform very, very similarly as long as you're willing to do some preprocessing on the data. So. Why, then, are we still using complicated models? There's some really good reasons. First of all, we like them. They're profitable. The compass people are making a profit off the US justice system.
It's very much easier to sell something like compass than to sell something like the Coral's model I had on the previous slide. Also, it's much easier to construct a black box model than a simple model to construct a black box that let you take your data, throw it into an algorithm, you get a model, whereas to construct a simpler model, it's much more difficult. You actually have to optimise for the simplicity of the model.
OK, so let's yeah, it's kind of ironic, right, that that complicated models are much easier to construct and that simple models can be really, really hard to find. OK, so let's say that we're doing supervised learning where we want to minimise the lost function to make our models accurate. But now we also want our models to be simple and we have to constrain them.
We have to force them to be simple. And once you get too constrained optimisation and depending on the constraints, the problem becomes much harder. Now, it's a problem in the left is about finding an accurate decision tree that is much easier than finding an optimal, sparse tree with the same level of accuracy. It's exponentially harder. If the problem on the left, oh, and here, here's some examples here, so this is cart of in a cart green card algorithm on on a data set.
And you know, that's very easy. If you want to find an optimal sparse tree with the same or better accuracy, you have to run a specialised algorithm that solves a much harder computational problem. And this algorithm on the left, on the right here, ghost is an algorithm that we've designed that was published in 2020, and it's getting, you know, accuracy's that are that are better than cart in much.
The models are much faster, but it does a huge amount of work to get to that, whereas card is like from the nineteen nineties or nineteen eighties, right? OK, now if that problem on the left is about finding an accurate linear model, well, that's easy. You know, you can do that with regression or logistic regression. But then once you had sparsity constraints, we all know that the problem becomes much harder.
So what about if you just unleash the most complexity you have on the problem, like we all know how easy it is to construct an accurate neural network or an accurate, boosted decision tree? Now my question is, can you get the same accuracy with maybe an accurate and fresh decision, Drake, or maybe an accurate its first linear model, right? Are these things? Am I going to get the same accuracy when I solve this problem as opposed to this problem?
So do I need to sacrifice accuracy in order to gain interpretability? So can I determine whether this equality is true without actually solving this? And what would it take to do that right? What would it take to check whether or not I could get an accurate and simple model? Same level of accuracy as my black box. OK, so in other words, can I determine the existence of a simple yet accurate model without actually finding one?
So that's what I'm trying to do during this talk. OK. So in this talk, I'm going to define a condition under which a simple yet accurate model is likely to exist in that condition is that the Rashomon set is large, and I'm going to tell you what that means in minute. OK, so the Rashomon set is the set of models with low true loss, the true Russian mindset, the set of models with low true loss.
OK, so fusion, your expected loss. This is, if you know, the whole distribution that the data comes from. And then this is abstract function space. Just my hypothesis space. And then this is the Russian mindset. It's the set of models that have expected loss less than some value theta. OK, now I claim that if the true romance that is large, so in other words, if there are a lot of good models. Then a simple yet accurate model is likely to exist. OK, so this is this is the idea.
The idea is that if there are a lot of good models, then hopefully at least one of them is simple. So this is kind of like a big fish theory, right? Like a big ocean theory. Like if you have you have a really big ocean, then you're more likely to find a big fish swimming in there somewhere. OK. So in a sea of equally accurate models, maybe there's a good one in there somewhere. Maybe there's at least one simple one. OK, now I'm just going to change my notation just very slightly.
Where instead of expected loss with all that notation, I'm just going to write L. OK, so l as expected loss. And yeah, oh, by the way, I drew like a really nice, smooth lost function there with like one minimum, but it doesn't have to be that way. In fact, the space could look like this and the Rashomon set could be disjoint. And in fact, it's possible for the whole space to be discreet. And so the romance that is just a bunch of like points in the in the space.
OK, so what I want to do now is. Create the simplest possible abstract setting to show you how this thing at the bottom could possibly happen. And I want to make it precise as to how this happens. I don't want to just wave my hands and say it happens. I want to actually just show you an abstract setting where it does happen, OK? So. I'm going to take two finite hypothesis spaces, so two finite function spaces F1, which is the set of simple models and F2, which is all models.
And I will say that F1 lives enough to write simple models live in all models. And in fact, I'm going to say that F1 is uniformly drawn from F2 without replacement. So I know this is abstract, and I know that simple models are not drawn randomly from a more complex model class. But in reality, as long as each complex model is reasonably close to a simple model, then everything's the same idea that I'm going to show you is going to work out just fine.
Now, let's say that Upstart is the best model if I knew everything. So it's the mall where I get to use all models, the but the best model to choose from, where I get to use all models and I know the whole distribution of where the data comes from. OK, so this is if I know everything you could use, the more complex middle class we know, I know the whole lost function, everything whereas this guy F1 hat. That's what I can get on my data. So this is the empirical risk.
Minimise her from a simple function class. OK, so that's what I would love to get if I knew everything. This is what I can get with my data. Now, what I want to know is if these two guys achieve the same level of accuracy. OK. So again, using my simpler notation, this this thing becomes l. This empirical risk becomes lhat. And I want to know whether the best true risk of the complex class, so I want to know whether after you start as close to the best empirical risk of the simpler class.
So that's all head of F1 habit. OK. So I want to know whether these two things are close. So in other words, I want to know whether what I compute on my data. Is close to the best possible thing I could get if I knew everything. And the bound is going to involve the Rashomon ratio. Now, the Rashomon ratio is the fraction of models that are good. OK, so it's the fraction of all models with low true loss. But divided by the total number of models.
OK, so this is the fraction of models that are good. And that's going to appear in my bound. OK, so I put all the notation from the previous slide up in the top here. And the bound goes like this. It says for any with high probability with Epsilon greater than zero, with high probability, and I haven't told you what this is yet, but I will. With high probability, with respect to all randomness. The empirical risk on the.
Function that I get from my data is close to the best possible thing I could get. And the probability with which this holds depends on the Rashomon ratio. So if the Marshman ratio is larger, so if I have more good models, then this sound is more likely to hold. And my what I can compete on, my data is going to be close to the best possible thing I can get for new everything.
Another nice thing about this band is that only depends on the size of F1, the smaller function class, it doesn't depend on the size of the larger function class. OK, so cool. So this this bound is saying, you know, as long as the Rashomon ratio is big enough, then then we're good. OK? So I notice this bounded a little. This probability here is a little inscrutable, so I'm going to give you some examples of its calculation.
So let's say that we had 100000 functions. Now, if at least one percent of them are good, then the bound holds with 99 percent probability when you have at least five hundred twenty six simple functions. So here's another example. Again, 100000 medals. Ratio, at least half a percent of them are good. Then the band holds with 99 percent probability. When just over a thousand of them are simple. So, Cynthia, can I ask you something? Absolutely.
So in your definition of their rational racial ratio, you have to a fairer parameter and then you have like a Gunma parameter. So are there related or yeah, sorry, they're the same. So theta is the same as gamma. OK. Yeah, sorry about that. Yeah, it's fine. And what happens if I feel like if if one is the same as the same size of two, then you think that the situation is like generating a sense and it boils down to our like understanding of like empirical risk minimisation.
Well, if everyone is the same, if everyone is the same as F two, then all of the models are simple and the bound trivially holds. So you're right. But yeah. Yeah. So this is using regular learning theory like this side is just regular learning theory. So just this part of it will give you you'll get back to a regular learning you. We're leveraging learning theory in this moment. Yeah, yeah. Sorry about the notation issue. There should have been a a gamma. I changed recently from theta to gamma.
Sorry, do you mind if I just ask, what do you mean when you say 100000 models? Obviously, you have continuous coefficients that the. No, no, no, I'm in a abstract setting where everything is discreet, so even if I'm in a setting where there are finite hypothesis spaces. So you want to think about this as being if your data live on a giant hypercube or in a giant categorical space where every you know, even if you have continuous functions, the realisations of them are discrete.
And so you should just think of this as just an abstract setting where where everything is discrete. OK. All right. Thank you. Yeah. And the thing I was going to say after this is that actually in general, the idea generalise is to the case of everything being continuous and smooth.
But you have to you have to replace some of the assumptions. And in particular, this random draw assumption, which is unrealistic, you would replace this with a smoothness assumption over the over the class of models, and you have to assume that that the that F1 is a good cover for F2. And then in that case, the whole idea generalise is to more realistic settings. Mm-Hmm. Yes. Actually, Konstantinos Gutsiness is asking, What does it mean that we uniformly draw F2?
Then we need F2 to be specific, but it seems that you are addressing this point now, right? Yeah, yeah. Yeah, I was just about you guys are like one step ahead of me, and that's totally great because it means that people are following this lecture, which makes me happy, which is really hard to do remotely. OK, cool. All right. Great.
So, yeah, so I gave some examples of this. And essentially what I what I was trying to say here is that if the Rashomon ratio is sufficiently large, so if you have a large enough set of good models, then with high probability, the best empirical risk over the simpler class is close to the best possible true risk of the larger class and the generalisation guarantee comes from F1.
So this is basically the simplest possible abstract setting where the Rashomon, you know, a large Rashomon ratio actually gives you a better guarantee on on the quality of your performance. Yeah, quality of your performance on data, right? As opposed to knowing everything. And as I mentioned, we in the paper, this is only the first theorem and there's a series of theorems that I won't have time to get into today.
But essentially, we replace the random draw assumption with Smith's assumptions so that everything is nice and smooth and that you're in that F1 is a good approximating set for F2. The other assumption that you can make is that the Rashomon set contains a really big ball so that that as long as F1 approximates F2 nicely, there's at least one F1 and that big Rorschach concept ball. And then you don't need the random draw assumption anymore, and it's much more realistic.
OK, so so the results I just showed you and the theorems that I didn't, that I don't have time to to talk about in the in the continuous setting suggests that as long as F1 is a good approximating set for F2 and the Rashomon set is large, then we might as well work with the simpler class because we're not getting any benefit from using the more complex class. You're going to get the same level of accuracy for simpler class, for the complex class.
So in other words, if decision trees, which are peaceful as constant functions approximate neural networks, which are smooth functions, and for my problem, if the Russians at large, then I can just work with decision trees because I'm not going to get any benefit from working with neural networks because decision trees approximate neural networks.
OK, now I want to point out that we're not doing standard learning theory, we're using standard learning theory, but but here this is not the same thing. So large Rashomon large Rashomon ratios pertain to the existence of good models of models with good generalisation and good performance. That's different than regular learning theory, right? Those regular learning theory compares empirical risk to a true risk for the same function or for a class of functions.
Here we're talking about existence of models from a different class. So the Rashomon ratio is not the same thing as the geometric margin that's used in support vector machines and other forms of learning theory because the margin is measured with respect to one model, whereas the freshman ratio is a function of many models. It's not. The same thing is the V-C dimension. The V-C Dimension is data independent. It's a it's a it's a property of a function class only.
Whereas the Rashomon ratio is a property of a specific dataset. The Rashomon ratio is large for a specific dataset and function class. It's not the same thing as algorithmic stability, which talks about the way you search through the space to find a model. Stability depends on making changes to a data set. Here are the data set is fixed. It's not the same thing as Rademacher complexity.
Rademacher complexity fits the function class's ability to fit noisy targets, whereas the RECCOMEND ratio uses fixed labels. It's not the same thing as a flat minimum, which has become popular in neural networks here. We don't necessarily even have to have a continuous function space. And the Russian mindset could include many local minimum. OK. So, all right, so that's what the theory says, the theory says that large Rashomon sets allow us to use simpler functions without losing accuracy.
Now, what happens in practise, what actually what actually happens in reality? Well, usually you can't figure that out because measuring the Rashomon ratio is not something that you could normally do because it would require you to look at the whole model class, which is not practical. So but today we're going to do it anyway, just to find out what happens. OK, so don't do this at home, but we're going to do it today and we're going to use the empirical Rashomon ratio because we have data.
So we'll use this quantity, which is the fraction of models that are good. OK, so this is just the number of functions with low lot, low empirical loss divided by the total number of functions. OK. All right, so and again, you'd never calculates that in reality, but we're going to do it. And in particular, I want to do an experiment. I'm going to compare two things. The first one is the size of the freshman set, and the second is the performance of left lots of different machine learning models.
OK, so the first part of the experiment, I'm going to calculate the size of the Russian mindset, and I'm going to estimate it. And the way I'll do it is using decision trees of depth. Seven. OK, why? Why decision trees of depth seven. Well. It's because a decision trees can be sampled. I can sample decision trees. OK. And also decision trees are peaceful, has constant functions.
There are good approximating, set for a much larger function space because they can fit and they can also filter data sets they like. If you the same trees that seven are pretty powerful, they can actually fit a lot of data sets really well. OK, so that's why that's why we're going to calculate the rational side, the empirical measurements at that way. OK. So, OK, so we have. So let's say let's say that we have that, OK, what's the other thing we're going to check?
And oh wait, I forgot to mention. Yeah, I just want to explain this a little more. So let's say we have a function space that we're interested in. That function space includes decision, tree support, vector machines, decisions. It's just a big function class that we're interested in, and what we're doing essentially is. OK, so here's here's the Rashomon set. We're going to approximate that whole function class with decision trees.
And I claim that decision trees of seven or a good cover for this space because they're they're just piece by constant functions and so they approximate smooth functions. And so and also, you know, they approximate random forest and boosted decision trees, too, which are essentially combinations of trees. So anyway, I'm going to just compute the fraction of decision trees at seven that are in the Rashomon set. And that's how I'm going to get this estimate of the size of the rational.
OK, so then the second part of the oh yeah, and these are all my trees, my little green dots are the trees. The second part of the experiment, I'm going to just run a whole bunch of different machine learning methods on the dataset, and I want to know whether they perform similarly, because if they do, it means that they all live in a big Russian one set.
OK, so that means if this is true, that means the regime on set can accommodate functions of lots of different types, right, because it could accommodate a support vector machine and a porous PC, a decision tree. So I want I want to know if all these methods perform similarly, and I want to know if they generalise and I want to know how that correlates with the size of the Russian mindset. OK. All right, so that's the experiment. Let me show you the results. OK, so the results are.
That when the Rashomon ratio is measured to be large by party, then all the methods tend to perform similarly, and they generalise. That's the result I'm going to show you in the next couple of slides. And interestingly, the result isn't always true. That surprised us, but we think it's because of an artefact of the way that we're measuring the size of the Rashomon set. And if features are correlated with each other, it's really easy to.
It's really easy to overestimate the size of the rash onset. And so sometimes our measurement of the Russian mindset is too small. OK. So great. Let me show you the experiment we did 64 data set. So the large number of data sets. Categorical data sets, real value data sets, regression data sets, synthetic data sets, the number of features ranged between three and seven hundred eighty four and the number of classes range between two indexes is just flat.
Just three downloaded the whole repository OK and generated and generated synthetic datasets, too. All right. Now, when we had a large freshman ratio, these are the kinds of results we get. Lots of different machine learning methods, they all perform very similarly like this is for different data sets here, five different machine learning methods. They're all performing very similarly, and they're all generalising between training and test.
OK, so this is for large freshman ratios, for small Rashomon ratios, but that's not always what we got. So for small freshman ratios, sometimes the accuracy would be all over the place like different methods would perform differently. And sometimes they wouldn't generalise as well between training and test like you're seeing over here with the large freshman ratios. This always happened with a small freshman ratios. There was a variety of different results like we could get.
Also, sometimes cases where everything generalise really well. But our theory luckily really applies to large freshman sets. So we're making a conclusion. You know, if the freshman said it's large, then you get this kind of these nice properties. Cynthia, how do you define a freshman said, being large or small, you're taking the data sets and then you are kind of like, I'm looking at the quantities of like the freshman sets, and that's how you define larger or risk model of the look.
Yeah. So believe it or not, valley is around 10 to the negative thirty seven, which we're sort of. These values came from important sampling rate with the set of decision trees of depth seven. So these are actually larger values and defending anything that was like 10 to the 38th and 39th or below is like a small Rashomon ratio. Yeah. And we were just looking relative to, you know, what we would get on these different data sets.
When you see the data and I question, do you think it is likely to be shown a speech that data sets allow for a function as basis with large récemment sets? That is a great question, and I think they do, but there is a lot I could say about that. So, so I work in interpretable machine learning, and we've been trying to design interpretable models for computer vision for a long time, and we've been able to create models for computer vision that that we that are interpretable.
But the definition of interpretability is different between computer vision than it is for, like other types of problems. So, for example, you would never want to do a decision tree on pixels. That's that doesn't make that's not interpretable, right? What you'd want to do is maybe kiss best reasoning where you say this part of the image looks like this part of this other image.
And for those for those types of problems, we've been able to design interpretable neural networks that are constrained to reason in this way, but still attain the same level of accuracy as regular black box neural networks. Mm hmm. And so I think the only reason we're able to do this is because the Rashomon that permits it, the romance that's large enough to permit it. And so I can say that about computer vision.
I obviously can't say that about every possible application of machine learning to every possible problem. I can only talk about the problems I've worked on, but I even started working in materials science, which is a super complicated domain. And even there we were able to find to find models that were interpretable to our human materials science colleagues that were as accurate or more accurate than the black boxes we could construct.
And I say more accurate because sometimes the insight you get from the interpretability actually allows you to boost accuracy. So I think I hope that answers your question. Mm hmm. Thanks. Yeah, that was a really good question. Thank you for asking.
OK. Great, so now, as I mentioned, you really can't measure the size of the Russian set in practise, but that's OK because we got a lot of information out of these experiments and in particular, if if the Rashomon ratio is large, we found that all the methods performed similarly in general as well. Now, if the methods performed differently, it's likely to be a small Rashomon ratio.
Now. We're not completely sure about this yet, but we think it is a viable, possible explanation for what's going on. And you know, it does explain why me and a lot of other people have found that algorithms perform similarly across many problems. It's because there's a large Rashomon ratio. Yeah. So why do simple? What so why do simple models perform? Well, it's possibly because there's a large rush immigration.
OK. We found something else besides the results that I just showed you that we were really surprised about and we found this on every single dataset that we examined. And what we planned is something called the Rashomon curve. I'll show you a cartoon of it before I actually show you the real thing. It's a plot of the Rashomon ratio versus the empirical risk. OK, so let's say that you take a hierarchy of hypothesis basis, so we have the simplest ones to the more complex ones.
So this is like decision trees of depth one, two, three, four, five, six and seven like that. OK. So embedded spaces. Now when when you go down this curve here, when you add more complexity, what should happen to these quantities? Well, as you add more complexity. The best empirical risk for each function class goes down because you can fit, you can fit better, you have more models that you can fit better so. So as we increase from here to here, we expect to go this way.
What about the Rashomon ratio? Well. The numerator goes up because you have more good models, but the denominator goes up as well because you have more models. So what, what happens? And as it actually turns out, oh, sorry, I just put the Rashomon ratio there for you again, the fraction of models that are good, right? Both the numerator and the denominator go up because you have more models in both cases.
As it turns out, it goes down, as it turns out, the denominator increases much more quickly than the numerator. So what happens is that you take your simplest function class and then you run it and you increase complexity a little bit. So what happens is that the the empirical risk goes down. But the Rashomon ratio tends to stay kind of constant. But then all of a sudden it just like nosedives.
And, you know, sometimes you overfed a little bit so you can see a little bit of like, you know, maybe it's going to sway a little bit, but most of the time it goes down so quickly that you can't even see, you can't even see this like, you know, full. And, you know, we were kind of surprised to see this because we saw it on every single data set that we examined. We were not expecting to see anything like this.
And sometimes you see the whole curve, like sometimes you see it go over and down, but sometimes it just goes down. You don't even see this part because if the simpler models already perform pretty well, it just kind of goes down. OK, so yeah, we saw it. And like I said, all kinds of different data sets of sometimes like I said, you see the little curve, sometimes you just see like parts of it, like that vertical part.
And there's always, you know, there's always some kind of like turning point up here or else it's the top. And yet we saw it on every single data that we experimented with. OK, so that's what happens on the training set. What about the test set? And luckily, statistical learning theory tells us what the difference is between the training and test results. OK, so the generalisation is better for smaller function classes than it is for larger function classes.
So you would expect to over fit when you have a really big function class and then your your true risk is going to become worse. Right? So what the what the theory kind of tells us is that we should really kind of be looking around this elbow what we call the Rossmann elbow, because this is the simplest function class that describes the data. Well, right, that has low empirical risk. So this elbow seems to be like a really good choice for model selection.
So let me show you the results. I'll show you the Rashwan curves for all 64 data sets. All right. And as you can see, maybe I should. Yeah. As you can, I'm going to zoom in to some of these just to show you what kind of what's going on here. And I just want to point out that we're averaging over 10 volts to plot these, both for her training and test empirical risks, as well as the Rashomon ratio.
And so what you see, you should see it going across and down or just down for all of these data sets. OK, so let me zoom in a little bit. So sometimes the theory was insightful. Sometimes you really did see like the elbow being the best model that really worked. But as you know, with randomness, statistical learning theory, it's all probabilistic. So sometimes it doesn't really work. Sometimes everything, just always generalised. And the training and test points were right on top of each other.
And and then sometimes we never generalise in which case you're just seeing these big uncertainty bands, right? The big generalisation, gaps between training and test. But regardless of which of these three situations you're in, the elbow just seems to be a good choice for model selection because again, it's the simplest function class that describes the data well, and in no cases did that to the elbow turn out to be a really bad choice, right?
Yes, so the elbow model always seems to be a good choice for model selection. So that makes you wonder, like where you are relative to the elbow. Right, because in real problems, you don't actually see the whole curve. You just end up, you know, you'd pick your function class and you, you know, ran your method and you don't know where you are on the curve.
You could be anywhere on this curve. So you might want to figure out where you are in the curve to figure out whether you're close to the elbow. So. And remember, you usually can't measure any point. You can't measure any point on this curve because the curve requires the Rashomon ratio, which is the fraction of good models, you can measure that. OK, so what can you do? Well. If you are in this part of the curve, then different models with different complexity levels perform differently.
So if you run a whole bunch of different machine learning methods with different levels of complexity and they all perform differently, then you're probably on this part of the curve here. And in that case, you probably want to. You probably want to increase your complexity and go to the elbow. Whereas if you're in, whereas if you're in in this part of the curve, well, then it doesn't matter which machine learning method you choose, they all perform very, very similarly.
You have very similar empirical risk. And so in that case, you might want to try to make the model simpler so you can go up toward the elbow because you probably won't lose performance if you make the model simpler. So I had been thinking that this kind of might explain some of the things that me and others have been observing across problems and across data. There are some problems like image net, where the field has been designing more and more complicated models, and it reduces error.
So maybe we're still in this part of the curve, right? And. You know, on the other hand. If you think about problems like Amnesty, where? No matter which method you use, you get 100 percent accuracy. So in that case, we're like, we may be on this part of the curve and we could use simpler models and still get 100 percent accuracy. And those simpler models might have other properties like they might generalise better outside of amnesty, and they might be more interpretable.
And then there's a kind of these kind of problems, the kind of problems I usually work on where it doesn't matter which machine learning method you pick. They just all kind of perform similarly, like for the re-arrest, you know, the group for your prediction, I feel like we're in this part of the Rashomon part because like with rigorous prediction, you can get a really simple model that predicts just as well as like your super complicated model.
And there's just an inherent level of noise like you just can't get. If you try to get more accurate, you'll just over fit, basically. Yeah, so for these types of problems, I think we probably want to be walking up up the curve, reducing complexity to get kind of a more simple model that is interpretable but still maintain your level of accuracy. Just walking up toward the elbow there.
OK, so what I've gotten to is an easy check, a simple check for the possible presence of a simpler yet accurate model, which is that you should pick several of your favourite machine learning methods and you run them all in the data set. OK. You run them all in the data set. If they all perform differently, your model class is maybe too small to include the elbow solution you can.
You can get a little bit more complex. So, yeah, use a more complex model class if other machine learning methods perform similarly. Your model class might be a little bit too big than you need, in which case you can try to find specialised models that will move you up. You know, they have the special properties like interpretability just decreased your complexity.
OK, so great. So, all right, I've defined my condition, which is that the Rashomon set is large and I've showed you that you don't need to calculate the Rashomon ratio. You can just try lots of different machine learning methods, and that gives you a sense of whether simpler solutions might exist. And now a lot of people don't believe me about this or they're not interested, and that's fine.
But sometimes it can be kind of silly like so I want to tell you a story that happened a couple of summers ago. I found out about this explainable machine learning. It's called the Explainable Machine Learning Challenge. And my group decided we had to enter it. I do a lot of data science competitions. I actually coach Duke's data science competition team where we enter data science competitions.
And then this thing came out and we were like, Oh, we got to do it. But I mean, the goal of the competition was to create a black box and explain it. And so we got the data set. It was a nice big data set from Flaco and loan defaults. The dataset had thousands of rows. Each one was a person and with their whole credit history, we had to decide whether or not they would default on their loan. And we looked at it and we and we thought, you know, this looks like it has a good data representation.
And I thought, could I be wrong? Could it be a problem with a good data representation where you or you need a black box? And so I said to my students, Look, I don't know about this competition. Just try running a bunch of different machine learning methods on the data set and see whether they all perform the same. So a day or so later, they came back and they said, Yep, all the methods are performing the same.
And then at that point, we pretty much knew that the dataset had a large Russian mindset. So we said, we said, OK, we think we can construct an interpretable model for this dataset. And so we had a debate like, should we follow the competition rules, should we create a black box and explain it?
Or should we actually try to create an apparently interpretable model? So we decided that for after about two seconds of debate, we decided that for a problem as important as credit risk, we should create an inherently interpretable model. So we did. We created a globally interpretable model with a create a beautiful visualisation tool that had the same accuracy as the best neural network that we could construct.
So in fact, it's all live. You can actually play with it. You can go to this data set data. That's right, the data Duke Data Science Go website, which is just running, it's just running on the Duke servers and you can play around with the with the Fishko dataset in our model. And I'm just showing you a snapshot of it. I don't want to bring up the whole thing, but basically that it had like a bunch of sub scales and you could click on the subscales and you could get points for different things.
Like, for instance, this is the delinquency sub score. And it's a set of sparse logistic regression models, essentially. So for instance, you'd get like a point for your percent of trades being never delinquent for this person. Actually, their trades were kind of delinquent. That's why they got a point. The number of months since the most recent delinquency, they get points for that.
And so it's yeah. So you just add up the points and each set of points would translate into a little score and you'd get up the score as it was. It was very nice. It was a nice d composable model, a little sparse logistic regression type models. And so we sent this in to the competition wondering what the judges would think of it. Because I thought they're going to have no idea how to judge this, because it's an inherently interpretable model.
And I was right. They had no idea how to judge this and we totally bombed. We did absolutely terribly. We didn't even place. But luckily, the judges realised that actually what happened was the judges didn't allow the the they didn't allow any of the judges to play with the visualisations that people had constructed. So every team that created a visualisation tool for their model of the judges didn't get to play with it.
So that gave us a major disadvantage. But luckily, the judges realised that their judging criteria wasn't very good and they saw value in what we did. And so they gave us an award. They actually created a little award for us. They created the Fake Recognition Award, acknowledging our submission for going above and beyond expectations with a fully transparent global model and a user friendly dashboard.
And so I was really excited about this, and I thought, OK, I'll send in, I'll send a write a paper about it. And we'll send it into a special issue for a journal on decision making. And I was told to email the editor guest editor of the special issue to see if the paper is appropriate. So I emailed the person So dear, fancy esteemed professor at Fancy Stanford University. And we have this paper, we don't know whether it fits into the scope of our of the special issue.
It's not a traditional methodology paper. It's an analysis of this competition dataset, including a globally interpretable machine learning model, didn't lose accuracy over the black boxes. It won this award. What do you think? And he sent me back this email saying, Dear Cynthia, thanks for reaching out. This is an interesting paper, but I'm afraid it's not a good fit for the special issue. It's also related to my own recent work on explainability of neural nets.
Is the phaco data still available? If so, could you share it? And I was like, Oh my gosh, you know, I send the guy a paper saying, Hey, you don't need a black box for this dataset. And he sends me back an email saying, I don't care about your paper, but can you send me the data so I can create a black box for it and explain it? And so that's unfortunately that the state of where things are at the moment.
OK, so to summarise, I have to find a condition under which a simple yet accurate model is likely to exist, which is that the Rashomon set is large. I showed a simple check for large freshman sets, which is to run many different machine learning methods on your data to see if they all perform similarly. If they do, there's a good chance that you have a large, that you have a large freshman set and that you can find a simpler model.
I introduced the notion of Rashomon curves, which we found to be true for every to have that characteristic pattern, for every dataset we examined. And so, yeah, so now that we know that interpretable yet accurate models tend to exist, we can go find them. And that's what my lab works on. It's finding these these models. So if finally, at the end of the talk, I get to introduce myself.
So, yeah, I leave the prediction analysis lab. Most of my time is dedicated to the problems of optimal decision trace. So finding really tiny little if then role based models like the Coral's model I showed you earlier. For recidivism, we have lots of we have the fastest code right now, but about three orders of magnitude for optimal decision trees.
I also work on medical scoring systems, which we've used for a lot of medical applications, and this is a model that's called that you helps to be score, which is used in intensive care units by doctors to help predict whether a patient will have a seizure. And that helps the doctors monitor the patient and and prevent brain damage and save lives. I also work on interpretable neural networks for computer vision.
And as I mentioned earlier, we've shown that you can create interpretable models for rear vision that have the same accuracy as black boxes. And we're using them now to do with a collaboration with radiologists to help with mammograms that reading mammograms automatically. With to provide a computer aided decision and not a computer computer. Your decision. Just an automated decision, right, where it's computer aided rather than automated.
I also work on data visualisation and dimension reduction, where we're trying to project high dimensional data onto low dimensional and to two dimensions so that you can understand the structure, high dimensional structure in the data. So we're trying to preserve as much of the high dimensional structure as possible when projecting onto 2-D.
And then I'm also I also work as one of three professors and almost exact opposite exactly matching project where we're trying to match units almost exactly so that we can do interpretable causal inference. And then finally, the last one is understanding the set of good models and the importance of variables, which is what you heard about. You heard about one of those projects today in this category.
And then finally, as I mentioned, I coach the Duke data science competition team where we rate automated, automated computer poetry. And we do image super resolution. This year, we were competing in a citation labelling competition, which was really fun. And yeah, I love competing in data science competitions, and I've been coaching students for years to do that. OK, thank you very much. Thanks a lot for this very thought provoking took. Cynthia. Yeah, we have some minutes for questions.
Judith Rousseau was asking something. ProPublica. Do you want to ask yourself? Sure. So are there some situations where you would be not quite sure, but the accuracy or relevance of your résumé? And so I'm not sure my question makes sense, but. Do you trust them? Yeah, we actually don't trust our freshman curve estimates that much. We're only trusting them to determine whether the Russian mindset is large because it's very difficult to estimate the sizes of really small Rashomon sets.
So if our estimates are that the Russian mind of small, then we just know it's small. We don't know really what its value is. And luckily, like I said in practise, you never really need to. You never really need to construct the Rashomon curve or the rational ratio.
Because if we're just gaining the insight from it to figure out, there's this sort of important information that if you try a lot of different machine learning methods and they all perform similarly that that you probably have a large Russian mindset and that's all that's all we really needed to glean from that from those estimates.
Let me sense. Thanks. OK. So you mentioned at some point the the fact that the regime on ratio is not the same as local minima, and I understand that the reason is like you may have like a several like a small or several local minima that would like, I mean, like a lot regime on set or you make a discreet hypothesis in space.
So they look at minimal narrative. Would it make sense, but otherwise a. What's the goal, how someone like that situation doesn't exist, so so then like what would be the relation between like the usual narrative in the planning, for example, about the local minima and its good properties and these like there being like large rational assets? So if you have if you have a flat minimum, then you do have a large Russian mindset, right?
Right. Because you would have like this flat area, the flat minimum, and then you'd be able to put a ball in there. It's just that that we can have a large Russian mindset without having a flat minimum. Well, I see. Yes. You know, I give not only sustenance figure above their name rational. Oh yeah. So that name came from from Leo Bremen, who got it from the movie Rashomon. So, so there's a Japanese movie. I haven't watched it yet. I've been meaning to watch it, but I have children.
And so it's kind of hard to like, you know, it's like you don't want to watch that movie about violent stuff with the kids. Alright, so I haven't been watching that. But it's a movie about a violent crime that occurred. And there's four different perspectives on the crime, and in the end, you end up thinking that there's no real truth, and there's just just a lot of different ways of seeing the same thing, but that there's no truth.
And so it's the same thing with with models, right? There's no true model. There's just a lot of different. There's just a lot of like there's no underlying truth, right? We don't we just have a finite dataset. So there's no truth. There's just a lot of models that perform well, just a lot of good explanations for what's what's actually happened. And so the Rashomon that is it's the set of good explanations for for the data.
Mm-Hmm. Know, I remember like this paper by Monday, he also talks about the the heat. He mentioned the late Rashomon and also come and I wonder, like what it like a Rashomon and all like perspective of like moral complexity and the philosophy of related or equivalent, or they are like pointing to different aspects. What do you think about that? Well, I think that Rashomon enables them right?
Because Pratima, large Rashomon that say that you can find a real, like a simpler model that explains the data well. Mm hmm. Yes. Yeah. Yeah, I got to. I'm one of the people who is lucky enough to get a chance to meet Leo Berman, although the time I met him, he told me that my paper on boosting was was not. He walked up to me during a nips poster session and he said, and I had been trying to prove I'd been trying to prove that at a boost, whether or not maximises the margin.
And he said, Well, I already proved that if you want to have a real thesis, you could do something else. And it was. But, you know, I ended up becoming friends with him and I remember him like, you know, waving to me at the end of the conference, and I did manage to actually prove the theorem. You know, in the end, I did prove that that added boost does not maximise its margin. But yeah, it was a it was interesting in getting a chance to meet him.
Yeah, just a really outspoken guy who's done amazing things because of his work in industry. And, you know, just kind of going out in the real world and understanding the value of things like interpretability for building, just creating decision trees and the value that they created for people. Hmm. OK. So the idea that, you know, there are no more questions about thanks a lot for your time and for your son took. Thanks. I wish I could meet all of you in person, but maybe day. Yeah. Thank you.
Bye bye. Thank you. Thanks, Michael. Thank you, bye. Yeah, thanks a lot.
