Let me show you how to be a good b... how they can help sampling Bayesian models and their similarity with normalizing flows that we discussed in episode 98. ARTO also introduces Prelease, a tool for prior elicitation, and highlights its benefits in simplifying the process of setting priors, thus improving the accuracy of our models. When ARTO is not solving mathematical equations, you'll find him cycling or around the good board game. This is Learning Bayesian Statistics, episode 103.
recorded February 15, 2024. Welcome to Learning Bayesian Statistics, a podcast about Bayesian inference, the methods, the projects, and the people who make it possible. I'm your host. You can follow me on Twitter at Alex underscore and Dora like the country for any info about the show. Learnbasedats .com is left last to me. Show notes, becoming a corporate sponsor, unlocking Bayesian Merch, supporting the show on Patreon. Everything is in there. That's Learnbasedats .com.
If you're interested in one -on -one mentorship, online courses, or statistical consulting, feel free to reach out and book a call at topmate .io slash Alex underscore. And Dora, see you around, folks, and best wishes to you all. Clemmy, welcome to Layer Name Patient Statistics. Thank you. You're welcome. How was my Finnish pronunciation? Oh, I think that was excellent. For people who don't have the video, I don't think that was true. So thanks a lot for taking the time, Artho.
I'm really happy to have you on the show. And I've had a lot of questions for you for a long time, and the longer we postpone the episode, the more questions. So I'm gonna do my best to not take three hours of your time. And let's start by... maybe defining the work you're doing nowadays and well, how do you end up working on this? Yes, sure. So I personally identify as a machine learning researcher. So I do machine learning research, but very much from a Bayesian perspective.
So my original background is in computer science. I'm essentially a self -educated statistician in the sense that I've never really kind of studied properly statistics design, well except for a few courses here and there. But I've been building models, algorithms, building on the Bayesian principles for addressing various kinds of machine learning problems. So you're basically like a self -taught statistician through learning, let's say. More or less, yes.
I think the first things I started doing, with anything that had to do with Bayesian statistics was pretty much already going to the deep end and trying to learn posterior inference for fairly complicated models, even actually non -parametric models in some ways. Yeah, we're going to dive a bit on that. Before that, can you tell us the topics you are particularly focusing on through that? umbrella of topics you've named.
Yes, absolutely. So I think I actually have a few somewhat distinct areas of interest. So on one hand, I'm working really on the kind of core inference problem. So how do we computationally efficiently, accurately enough approximate the posterior distributions? Recently, we've been especially working on inference algorithms that build on concepts from Riemannian geometry.
So we're trying to really kind of account the actual manifold induced by this posterior distribution and try to somehow utilize these concepts to kind of speed up inference. So that's kind of one very technical aspect. Then there's the other main theme on the kind of Bayesian side is on priors. So we'll be working on prior elicitation. So how do we actually go about specifying the prior distributions? and ideally maybe not even specifying.
So how would we extract that knowledge from a domain expert who doesn't necessarily even have any sort of statistical training? And how do we flexibly represent their true beliefs and then encode them as part of a model? That's maybe the main kind of technical aspects there. Yeah. Yeah. No, super fun. And we're definitely going to dive into those two aspects a bit later in the show. I'm really interested in that.
Before that, do you remember how you first got introduced to Bayesian inference, actually, and also why it sticks with you? Yeah, like I said, I'm in some sense self -trained. I mean, coming with the computer science background, we just, more or less, sometime during my PhD, I was working in a research group that was led by Samuel Kaski. When I joined the group, we were working on neural networks of the kind that people were interested in. That was like 20 years ago.
So we were working on things like self -organizing maps and these kind of methods. And then we started working on applications where we really bumped into the kind of small sample size problems. So looking at... DNA microarray data that was kind of tens of thousands of dimensions and medical applications with 20 samples. So we essentially figured out that we're gonna need to take the kind of uncertainty into account properly.
Started working on the Bayesian modeling side of these and one of the very first things I was doing is kind of trying to create Bayesian versions of some of these classical analysis methods that were especially canonical correlation analysis. That's the original derivation is like an information theoretic formulation. So I kind of dive directly into this that let's do Bayesian versions of models. But I actually do remember that around the same time I also took a course, a course by Akivehtari.
He's his author of this Gelman et al. book, one of the authors. I think the first version of the book had been released. just before that. So Aki was giving a course where he was teaching based on that book. And I think that's the kind of first real official contact on trying to understand the actual details behind the principles. Yeah, and actually I'm pretty sure listeners are familiar with Aki. He's been on the show already, so I'll link to the episode, of course, where Aki was.
And yeah, for sure. I also recommend going through these episodes, show notes for people who are interested in, well, starting learning about basic stuff and things like that. Something I'm wondering from what you just explained is, so you define yourself as a machine learning researcher, right? And you work in artificial intelligence too. But there is this interaction with the Bayesian framework.
How does that framework underpin your research in statistical machine learning and artificial intelligence? How does that all combine? Yeah. Well, that's a broad topic. There's of course a lot in that intersection. I personally do view all learning problems in some sense from a Bayesian perspective.
I mean, no matter what kind of a, whether it's a very simple fitting a linear regression type of a problem or whether it's figuring out the parameters of a neural network with 1 billion parameters, it's ultimately still a statistical inference problem. I mean, most of the cases, I'm quite confident that we can't figure out the parameters exactly. We need to somehow quantify for the uncertainty. I'm not really aware of any other kind of principled way of doing it.
So I would just kind of think about it that we're always doing Bayesian inference in some sense. But then there's the issue of how far can we go in practice? So it's going to be approximate. It's possibly going to be very crude approximations. But I would still view it through the lens of Bayesian statistics in my own work. And that's what I do when I teach for my BSc students, for example.
I mean not all of them explicitly formulate the learning algorithms kind of from these perspectives but we are still kind of talking about that what's the relationship what can we assume about the algorithms what can we assume about the result and how would it relate to like like properly estimating everything through kind of exactly how it should be done. Yeah okay that's an interesting perspective yeah so basically putting that in a in that framework.
And that means, I mean, that makes me think then, how does that, how do you believe, what do you believe, sorry, the impact of Bayesian machine learning is on the broader field of AI? What does that bring to that field? It's a, let's say it has a big effect. It has a very big impact in a sense that pretty much most of the stuff that is happening on the machine learning front and hence also on the kind of all learning based AI solutions.
It is ultimately, I think a lot of people are thinking about roughly in the same way as I am, that there is an underlying learning problem that we would ideally want to solve more or less following exactly the Bayesian principles. don't necessarily talk about it from this perspective. So you might be happy to write algorithms, all the justification on the choices you make comes from somewhere else.
But I think a lot of people are kind of accepting that it's the kind of probabilistic basis of these. So for instance, I think if you think about the objectives that people are optimizing in deep learning, they're all essentially likelihoods of some assume probabilistic model. Most of the regularizers they are considering do have an interpretation of some kind of a prior distribution.
I think a lot of people are all the time going deeper and deeper into actually explicitly thinking about it from these perspectives. So we have a lot of these deep learning type of approaches, various autoencoders, Bayesian neural networks, various kinds of generative AI models that are They are actually even explicitly formulated as probabilistic models and some sort of an approximate inference scheme. So I think the kind of these things are, they are the same two sides of the same coin.
People are kind of more and more thinking about them from the same perspective. Okay, yeah, that's super interesting. Actually, let's start diving into these topics from a more technical perspective. So you've mentioned the research and advances you are working on regarding Romanian spaces. So I think it'd be super fun to talk about that because we've never really talked about it on the show. So maybe can you give listeners a primer on what a Romanian space is?
Why would you even care about that? And what you are doing in this regard, what your research is in this regard. Yes, let's try. I mean, this is a bit of a mathematical concept to talk about. But I mean, ultimately, if you think about most of the learning algorithms, so we are kind of thinking that there are some parameters that live in some space.
So we essentially, without thinking about it, that we just assume that it's a Euclidean space in a sense that we can measure distances between two parameters, that how similar they are. It doesn't matter which direction we go, if the distance is the same, we think that they are kind of equally far away. So now a Riemannian geometry is one that is kind of curved in some sense. So we may be stretching the space in certain ways and we'll be doing this stretching locally.
So what it actually means, for example, is that the shortest path between two possible values, maybe for example two parameter configurations, that if you start interpolating between two possible values for a parameter, it's going to be a shortest path in this Riemannian geometry, which is not necessarily a straight line in an underlying Euclidean space. So that's what the Riemannian geometry is in general. So it's kind of the tools and machinery we need to work with these kind of settings.
And now then the relationship to statistical inference comes from trying to define such a Riemannian space that it has somehow nice characteristics. So maybe the concept that most of the people actually might be aware of would be the Fisher information matrix that kind of characterizes the kind of the curvature induced by a particular probabilistic model.
So these tools kind of then allow, for example, a very recent thing that we did, it's going to come out later this spring in AI stats, is an extension of the Laplace approximation in a Riemannian geometry. So those of you who know what the Laplace approximation is, it's essentially just fitting a normal distribution at the mode of a distribution.
But if we now fit the same normal distribution in a suitably chosen Riemannian space, we can actually model also the kind of curvature of the posterior mode and even kind of how it stretches. So we get a more flexible approximation. We are still fitting a normal distribution. We're just doing it in a different space. Not sure how easy that was to follow, but at least maybe it gives some sort of an idea. Yeah, yeah, yeah. That was actually, I think, a pretty approachable.
introduction and so if I understood correctly then you're gonna use these Romanian approximations to come up with better algorithms is that what you do and why you focus on Romanian spaces and yeah if you can if you can introduce that and tell us basically why that is interesting to then look at geometry from these different ways instead of the classical Euclidean way of things geometry. Yeah, I think that's exactly what it is about.
So one other thing, maybe another perspective of thinking about it is that we've also been doing Markov chain Monte Carlo algorithms, so MCMC in these Riemannian spaces. And what we can achieve with those is that if you have, let's say, a posterior distribution, that has some sort of a narrow funnel, some very narrow area that extends far away in one corner of your parameter space.
It's actually very difficult to get there with something like standard Hamiltonian Monte Carlo, but with the Riemannian methods we can kind of make these narrow funnels equally easy compared to the flatter areas. Now of course this may sound like a magic bullet that we should be doing all inference with these techniques. Of course it does come with certain computational challenges. So we do need to be, like I said, the shortest paths are no longer straight lines.
So we need numerical integration to follow the geodesic paths in these metrics and so on. So it's a bit of a compromise, of course. So they have very nice theoretical properties. We've been able to get them working also in practice in many cases so that they are kind of comparable with the current state of the art. But it's not always easy. Yeah, there is no free lunch. Yes. Yeah. Yeah. Do you have any resources about these?
Well, first the concepts of Romanian spaces and then the algorithms that you folks derived in your group using these Romanian space for people who are interested? Yeah, I think I wouldn't know, let's say a very particular reasons I would recommend on the Romanian geometry. It is actually a rather, let's say, mathematically involved topic. But regarding the specific methods, I think they are...
It's a couple of my recent papers, so we have this Laplace approximation is coming out in AI stats this year. The MCMC sampler we had, I think, two years ago in AI stats, similarly, the first MCMC method building on these and then... last year one paper on transactions of machine learning research. I think they are more or less accessible. Let's definitely link to those papers if you can in the show notes because I'm personally curious about it but also I think listeners will be.
It sounds from what you're saying that this idea of doing algorithms in this Romanian space is somewhat recent. Am I right? And why would it appear now? Why would it become interesting now? Well, it's not actually that recent. I think the basic principle goes back, I don't know, maybe 20 years or so. I think the main reason why we've been working on this right now is that the We've been able to resolve some of the computational challenges.
So the fundamental problem with these models is always this numeric integration of following the shortest paths depending on an algorithm we needed for different reasons, but we always needed to do it, which usually requires operations like inversion of a metric tensor, which has the kind of a dimensionality of the parameter space. So we came up with the particular metric. that happens to have computationally efficient inverse.
So there's kind of this kind of concrete algorithmic techniques that are kind of bringing the computational cost to the level so that it's no longer notably more expensive than doing kind of standard Euclidean methods. So we can, for example, scale them for Bayesian neural networks. That's one of the application cases we are looking at. We are really having very high -dimensional problems but still able to do some of these Riemannian techniques or approximations of them.
That was going to be my next question. In which cases are these approximations interesting? In which cases would you recommend listeners to actually invest time to actually use these techniques because they have a better chance of working than the classic Hamiltonian Monte Carlo semper that are the default in most probabilistic languages? Yeah, I think the easy answer is that when the inference problem is hard.
So essentially one very practical way would be that if you realize that you can't really get a Hamiltonian Monte Carlo to explore the space, the posterior properly, that it may be difficult to find out that this is happening. Of course, if you're ever visiting a certain corner, you wouldn't actually know.
But if you have some sort of a reason to believe that you really are handling with such a complex posterior that I'm kind of willing to spend a bit more extra computation to be careful so that I really try to cover every corner there is.
Another example is that we realized on the scope of these Bayesian neural networks that there are certain kind of classical Well, certain kind of scenarios where we can show that if you do inference with the two simple methods, so something in the Euclidean metric with the standard Vangerman dynamics type of a thing, what we actually see is that if you switch to using better prior distributions in your model, you don't actually see an advantage
of those unless you at the same time switch to using an inference algorithm that is kind of able to handle the extra complexity. So if you have for example like heavy tail spike and slap type of priors in the neural network. You just kind of fail to get any benefit from these better priors if you don't pay a bit more attention into how you do the inference. Okay, super interesting.
And also, so that seems it's also quite interesting to look at that when you have, well, or when you suspect that you have multi -modal posteriors. Yes, well yeah, multimodal posteriors are interesting.
I'm not, we haven't specifically studied like this question that is there and we have actually thought about some ideas of creating metrics that would specifically encourage exploring the different modes but we haven't done that concretely so we now still focusing on these kind of narrow thin areas of posteriors and how can you kind of reach those. Okay. And do you know of normalizing flows? Sure, yes. So yeah, we've had Marie -Lou Gabriel on the show recently.
It was episode 98. And so she's working a lot on these normalizing flows and the idea of assisting NCMC sampling with these machine learning methods. And it's amazing. can sound somewhat similar to what you do in your group. And so for listeners, could you explain the difference between the two ideas and maybe also the use cases that both apply to it? Yeah, I think you're absolutely right. So they are very closely related.
So there are, for example, the basic idea of the neural transport that uses normalizing flows for essentially transforming the parameter space in a suitable non -linear way and then running standard Euclidean Hamiltonian Monte Carlo. It can actually be proven. I think it is in the original paper as well that I mean it is actually mathematically equivalent to conducting Riemannian inference in a suitable metric.
So I would say that it's like a complementary approach of solving exactly the same problem. So you have a way of somehow in a flexible way warping your parameter space. You either do it through a metric or you kind of do it as a pre -transformation. So there's a lot of similarities. It's also the computation in some sense that if you think about mapping... sample through a normalizing flow.
It's actually very close to what we do with the Riemannian Laplace approximation that you start kind of take a sample and you start propagating it through some sort of a transformation. It's just whether it's defined through a metric or as a flow. So yes, so they are kind of very close. So now the question is then that when should I be using one of these? I'm afraid I don't really have an answer.
that in a sense that I mean there's computational properties on let's say for example if you've worked with flows you do need to pre -train them so you do need to train some sort of a flow to be able to use it in certain applications so it comes with some pre -training cost. Quite likely during when you're actually using it it's going to be faster than working in a Riemannian metric where you need to invert some metric tensors and so on. So there's kind of like technical differences.
Then I think the bigger question is of course that if we go to really challenging problems, for example, very high dimensions, that which of these methods actually work well there. For that I don't quite now have an answer in the sense that I would dare to say that or even speculate that which of these things I might miss some kind of obvious limitations of one of the approaches if trying to kind of extrapolate too far. from what we've actually tried in practice.
Yeah, that's what I was going to say. It's also that these methods are really at the frontier of the science. So I guess we're lacking, we're lacking for now the practical cases, right? And probably in a few years we'll have more ideas of these and when one is more appropriate than another. But for now, I guess we have to try. those algorithms and see what we get back. And so actually, what if people want to try these Romanian based algorithms?
Do you have already packages that we can link to that people can try and plug their own model into? Yes and no. So we have released open source code with each of the research papers. So there is a reference implementation that can be used. We have internally been integrating these, kind of working a bit towards integrating the kind of proper open ecosystems that would allow, make like for example model specification easy. It's not quite there yet.
So there's one particular challenge is that many of the environments don't actually have all the support functionality you need for the Riemannian methods. They're essentially simplifying some of the things that directly encoding these assumptions that the shortest path is an interpolation or it's a line. So you need a bit of an extra machinery for the most established libraries.
There are some libraries, I believe, that are actually making it fairly easy to do kind of plug and play Riemannian metrics. I don't remember the names right now, but that's where we've kind of been. planning on putting in the algorithms, but they're not really there yet. Hmm, OK, I see. Yeah, definitely that would be, I guess, super, super interesting. If by the time of release, you see something that people could try, definitely we'll link to that, because I think listeners will be curious.
And I'm definitely super curious to try that. Any new stuff like that, or you'd like to? try and see what you can do with it. It's always super interesting. And I've already seen some very interesting experiments done with normalizing flows, especially Bayox by Colin Carroll and other people. Colin Carroll is one of the EasyPindC developer also.
And yeah, now you can use Bayox to take any a juxtifiable model and you plug that into it and you can use the flow MC algorithm to sample your juxtifiable PIMC model. So that's really super cool. And I'm really looking forward to more experiments like that to see, well, okay, what can we do with those algorithms? Where can we push them to what extent, to what degree, where do they fall down? That's really super interesting, at least for me, because I'm not a mathematician.
So when I see that, I find that super, like, I love the idea of, basically the idea is somewhat simple. It's like, okay, we have that problem when we think about geometry that way, because then the geometry becomes a funnel, for instance, as you were saying. And then sampling at the bottom of the funnel is just super hard in the way we do it right now, because just super small distances. What if we change the definition of distance?
What if we change the definition of geometry, basically, which is this idea of, OK, let's switch to Romanian space. And the way we do that, then, well, the funnel disappears, and it just becomes something easier. It's just like going beyond the idea of the centered versus non -centered parameterization, for instance, when you do that in model, right? But it's going big with that because it's more general. So I love that idea.
I understand it, but I cannot really read the math and be like, oh, OK, I see what that means. So I have to see the model and see what I can do and where I can push it. And then I get a better understanding of what that entails. Yeah, I think you gave a much better summary of what it is doing than I did. So good for that. I mean, you are actually touching that, of course. So there's the one point is making the algorithms. available so that everyone could try them out.
But then there's also the other aspect that we need to worry about, which is the proper evaluation of what they're doing. I mean, of course, most of the papers when you release a new algorithm, you need to emphasize things like, in our case, computational efficiency. And you do demonstrate that it, maybe for example, being quite explicitly showing that these very strong funnels, it does work better with those.
But now then the question is of course that how reliable these things are if used in a black box manner in a so that someone just runs them on their favorite model. And one of the challenges we realized is that it's actually very hard to evaluate how well an algorithm is working in an extremely difficult case. Because there is no baseline. I mean, in some of the cases we've been comparing that let's try to do... standard Hamiltonian MCMC on nuts as carefully as we can.
And they kind of think that this is the ground truth, this is the true posterior. But we don't really know whether that's the case. So if it's hard enough case, our kind of supposed ground truth is failing as well. And it's very hard to kind of then we might be able to see that our solution differs from that. But then we would need to kind of separately go and investigate that which one was wrong. And that is a practical challenge, especially if you would like to have a broad set of models.
And we would want to show somehow transparently for the kind of end users that in these and these kind of problems, this and that particular method, whether it's one of ours or something else, any other new fancy. When do they work when they don't? Without relying that we really have some particular method that they already trust and we kind of, if it's just compared to it, we can't kind of really convince others that is it correct when it is differing from what we kind of used to rely on.
Yeah, that's definitely a problem. That's also a question I asked Marilu. when she was on the show and then that was kind of the same answer if I remember correctly that for now it's kind of hard to do benchmarks in a way, which is definitely an issue if you're trying to work on that from a scientific perspective as well. If we were astrologists, that'd be great, like then we'd be good. But if you're a scientist, then you want to evaluate your methods and...
And finding a method to evaluate the method is almost as valuable as finding the method in the first place. And where do you think we are on that regarding in your field? Is that an active branch of the research to try and evaluate these algorithms? How would that even look like? Or are we still really, really at a very early time for that work? That's a... Very good question. So I'm not aware of a lot of people that would kind of specifically focus on evaluation.
So for example, Aki has of course been working a lot on that, trying to kind of create diagnostics and so on. But then if we think about more on the flexible machine learning side, I think my hunch is that it's the individual research groups are kind of all circling around the same problems that they are kind of trying to figure out that, okay, Every now and then someone invents a fancy way of evaluating something.
It introduces a particular type of synthetic scenario where I think that the most common in tries that what people do is that you create problems where you actually have an analytic posterior and it's somehow like an artificial problem that you take a problem and you transform it in a given way and then you assume that I didn't have the analytic one. But they are all, I mean, they feel a bit artificial. They feel a bit synthetic. So let's see.
It would maybe be something that the community should kind of be talking a bit more about on a workshop or something that, OK, let's try to really think about how to verify the robustness or possibly identify that these things are not really ready or reliable for practical use in very serious applications yet. Yeah. I haven't been following very closely what's happening, so I may be missing some important works that are already out there. Okay, yeah.
Well, Aki, if you're listening, send us a message if we forgot something. And second, that sounds like there are some interesting PhDs to do on the issue, if that's still a very new branch of the research. So, people? If you're interested in that, maybe contact Arto and we'll see. Maybe in a few months or years, you can come here on the show and answer the question I just asked.
Another aspect of your work I really want to talk about also that I really love and now listeners can relax because that's going to be, I think, less abstract and closer to their user experience. is about priors. You talked about it a bit at the beginning, especially you are working and you worked a lot on a package called Prelease that I really love. One of my friends and fellow Pimc developers, Osvaldo Martin, is also collaborating on that. And you guys have done a tremendous job on that.
So yeah, can you give people a primer about Prelease? What is it? When could they use it and what's its purpose in general? Maybe I need to start by saying that I haven't worked a lot on prelease. Osvaldo has and a couple of others, so I've been kind of just hovering around and giving a bit of feedback. But yeah, so I'll maybe start a bit further away, so not directly from prelease, but the whole question of prior elicitation. So I think the...
Yeah. What we've been working with that is the prior elicitation is simply an, I would frame it as that it's some sort of unusually iterative approach of communicating with the domain expert where the goal is to estimate what's their actual subjective prior knowledge is on whatever parameters the model has and doing it so that it's like cognitively easy for the expert. So many of the algorithms that we've been working on this are based on this idea of predictive elicitation.
So if you have a model where the parameters don't actually have a very concrete, easily understandable meaning, you can't actually start asking questions from the expert about the parameters. It would require them to understand fully the model itself. The predictive elicitation techniques kind of ask communicate with the expert usually in the space of the observable quantities. So they're trying to make that is this somehow more likely realization than this other one.
And now this is where the prelease comes into play. So when we are communicating with the user, so most of the times the information we show for the user is some sort of visualizations. of predictive distributions or possibly also about the parameter distributions themselves. So we need an easy way of communicating whether it's histograms of predicted values and whatnot.
So how do we show those for a user in scenarios where the model itself is some sort of a probabilistic program so we can't kind of fixate to a given model family. That's actually what's the main role of Prelease is essentially making it easy to interface with the user. Of course, Prelease also then includes these algorithms themselves. So, algorithms for estimating the prior and the kind of interface components for the expert to give information.
So, make a selection, use a slider that I would want my distribution to be a bit more skewed towards the right and so on. That's what we are aiming at. A general purpose tool that would be used, it's essentially kind of a platform for developing and kind of bringing into use all kinds of prioritization techniques.
So it's not tied to any given algorithm or anything but you just have the components and could then easily kind of commit, let's say, a new type of prioritization algorithm into the library. Yeah and I re -encourage folks to go take a look at the prelease package.
I put the link in the show notes because, yeah, as you were saying, that's a really easier way to specify your priors and also elicit them if you need the intervention of non -statisticians in your model, which you often do if the model is complex enough. So yeah, like... I'm using it myself quite a lot. So thanks a lot guys for this work. So Arto, as you were saying, Osvaldo Martín is one of the main contributors, Oriol Abril Blas also, and Alejandro Icazati, if I remember correctly.
So at least these four people are the main contributors. And yeah, so I definitely encourage people to go there. What would you say, Arto, are the... like the Pareto effect, what would it be if people want to get started with Prelease? Like the 20 % of uses that will give you 80 % of the benefits of Prelease for someone who don't know anything about it. That's a very good question. I think the most important thing actually is to realize that we need to be careful when we set the priors.
So simply being aware that you need a tool for this. You need a tool that makes it easy to do something like a prior predictive check. You need a tool that relieves you from figuring out how do I inspect. my priors or the effects it has on the model. That's actually where the real benefit is. You get most of the... when you kind of try to bring it as part of your Bayesian workflow in a kind of a concrete step that you identify that I need to do this.
Then the kind of the remaining tale of this thing is then of course that the... maybe in some cases you have such a complicated model that you really need to deep dive and start... running algorithms that help you eliciting the priors. And I would actually even say that the elicitation algorithms, I do perceive them useful even when the person is actually a statistician. I mean, there's a lot of models that we may think that we know how to set the priors.
But what we are actually doing is following some very vague ideas on what's the effect. And we may also make severe mistakes or spend a lot of time in doing it. So to an extent these elicitation interfaces, I believe that ultimately they will be helping even kind of hardcore statisticians in just kind of doing it faster, doing it slightly better, doing it perhaps in a more better documented manner. So you could for example kind of store all the interaction the modeler had.
with these things and kind of put that aside that this is where we got the prior from instead of just trial and error and then we just see at the end the result. So you could kind of revisit the choices you made during an elicitation process that I discarded these predictive distributions for some reason and then you can later kind of, okay I made a mistake there maybe I go and change my answer in that part and then an algorithm provides you an updated prior.
without you needing to actually go through the whole prior specification process again. Yeah. Yeah. Yeah, I really love that. And that makes the process of setting priors more reproducible, more transparent in a way. That makes me think a bit of the scikit -learn pipelines that you use to transform the data. For instance, you just set up the pipeline and you say, I want to standardize my data, for instance. And then you have that pipeline ready.
And when you do the auto sample predictions, you can use the pipeline and say, okay, now like do that same transformation on these new data so that we're sure that it's done the right way, but it's still transparent and people know what's going on here. It's a bit the same thing, but with the priors. And I really love that because that makes it also easier for people to think about the priors and to actually choose the priors.
Because. What I've seen in teaching is that especially for beginners, even more when they come from the Frequentis framework, sending the priors can be just like paralyzing. It's like products of choice. It's way too many, way too many choices. And then they end up not choosing anything because they are too afraid to choose the wrong prior. Yes, I fully agree with that. I mean, there's a lot of very simple models. that already start having six, seven, eight different univariate priors there.
And then I've been working with these things for a long time and I still very easily make stupid mistakes that I'm thinking that I increase the variance of this particular prior here, thinking that what I'm achieving is, for example, higher predictive variance as well. And then I realized that, no, that's not the case. It's actually... Later in the model, it plays some sort of a role and it actually has the opposite effect. It's hard. Yeah. Yeah. That stuff is really hard and same here.
When I discovered that, I'm extremely frustrated because I'm like, I always did hours on these, whereas if I had a more producible pipeline, that would just have been handled automatically for me. So... Yeah, for sure. We're not there yet in the workflow, but that definitely makes it way easier. So yeah, I absolutely agree that we are not there yet. I mean, the Prellis is a very well -defined tool that allows us to start working on it.
But I mean, then the actual concrete algorithms that would make it easy to let's say for example, avoid these kind of stupid mistakes and be able to kind of really reduce the effort. So if it now takes two weeks for a PhD student trying to think about and fiddle with the prior, so can we get to one day? Can we get it to one hour? Can we get it to two minutes of a quick interaction? And probably not two minutes, but if we can get it to one hour and it... It will require lots of things.
It will require even better of this kind of tooling. So how do we visualize, how do we play around with it? But I think it's going to require quite a bit better algorithms on how do you, from kind of maximally limited interaction, how do you estimate. what the prior is and how you design the kind of optimal questions you should be asking from the expert.
There's no point in kind of reiterating the same things just to fine -tune a bit one of the variances of the priors if there is a massive mistake still somewhere in the prior and a single question would be able to rule out half of the possible scenarios. It's going to be an interesting... let's say, rise research direction, I would say, for the next 5, 10 years. Yeah, for sure. And very valuable also because very practical. So for sure, again, a great PhD opportunity, folks. Yeah, yeah.
Also, I mean, that may be hard to find those algorithms that you were talking about because it is hard, right? I know I worked on the... find constraint prior function that we have in PMC now. And it's just like, it seemed like a very simple case. It's not even doing all the fancy stuff that Prellis is doing. It's mainly just optimizing distribution so that it fits the constraints that you are giving it. Like for instance, I want a gamma with 95 % of the mass between 2 and 6.
Give me the... parameters that fit that constraint. That's actually surprisingly hard mathematically. You have a lot of choices to make, you have a lot of things to really be careful about. And so I'm guessing that's also one of the hurdles right now in that research. Yeah, it absolutely is. I mean, I would say at least I'm approaching this.
more or less from an optimization perspective then that I mean, yes, we are trying to find a prior that best satisfies whatever constraints we have and trying to formulate an optimization problem of some kind that gets us there. This is also where I think there's a lot of room for the, let's say flexible machine learning tools type of things.
So, I mean, if you think about the prior that satisfies these constraints, we could be specifying it with some sort of a flexible not a particular parametric prior but some sort of a flexible representation and then just kind of optimizing for within a much broader set of this. But then of course it requires completely different kinds of tools that we are used to working on. It also requires people accepting that our priors may take arbitrary shapes.
They may be distributions that we could have never specified directly. Maybe they're multimodal. priors that we kind of just infer that this is what you couldn't really and there's going to be also a lot of kind of educational perspective on getting people to accept this.
But even if I had to give you a perfect algorithm that somehow cranks out a prior and then you look at the prior and you're saying that I don't even know what distribution this is, I would have never ever converged into this if I was manually doing this. So will you accept? that that's your prior or will you insist that your method is doing something stupid? I mean, I still want to use my my Gaussian prior here. Yeah, that's a good point.
And in a way that's kind of related to a classic problem that you have when you're trying to automate a process. I think there's the same issue with the automated cars, like those self -driving cars, where people actually trust the cars more if they think they have some control over it. I've seen interesting experiments where they put a placebo button in the car that people could push on to override if they wanted to, but the button wasn't doing anything.
People are saying they were more trustworthy of these cars than the completely self -driving cars. That's also definitely something to take into account, but that's more related to the human psychology than to the algorithms per se. related to human psychology but it's also related to this evaluation perspective.
I mean of course if we did have a very robust evaluation pattern that somehow tells that once you start using these techniques your final conclusions in some sense will be better and if we can make that kind of a very convincing then it will be easier. I mean if you think about, I mean there's a lot of people that would say that very massive neural network with four billion parameters. It would never ever be able to answer a question given in a natural language.
A lot of people were saying that five years ago that this is a pipeline, it's never gonna happen. Now we do have it and now everyone is ready to accept that yes, it can be done. And they are willing to actually trust these judge -y pity type of models in a lot of things. And they are investing a lot of effort into figuring out what to do with this. It just needs this kind of very concrete demonstration that there is value and that it works well enough.
It will still take time for people to really accept it, but I mean, I think that's kind of the key ingredient. Yeah, yeah. I mean, it's also good in some way. Like that skepticism makes the tools better. So that's good. I mean, so we could... Keep talking about Prolis because I have other technical questions about that. But actually, since you're like, that's a perfect segue to a question I also had for you because you have a lot of experience in that field.
So how do you think can industries better integrate the patient approaches into their data science workflows? Because that's basically what we ended up talking about right now without me nudging you towards it. Yeah, I have actually indeed been thinking about that quite a bit. So I do a lot of collaboration with industrial partners in different domains. I think there's a couple of perspectives to this.
So one is that, I mean, people are finally, I think they are starting to accept the fact that probabilistic programming with kind of black box automated inference is the only sensible way. doing statistical modeling. So looking at back like 10 -15 years ago, you would still have a lot of people, maybe not in industry but in research in different disciplines, in meteorology or physics or whatever.
People would actually be writing Metropolis -Hastings algorithms from scratch, which is simply not reliable in any sense. I mean, it took time for them to accept that yes, we can actually now do it with something like Stan. I think this is of course the way that to an extent that there are problems that fit well with what something like Stan or Priency offers. I think we've been educating long enough master students who are kind of familiar with these concepts.
Once they go to the industry they will use them, they know roughly how to use them. So that's one side. But then the other thing is that I think... Especially in many of these predictive industries, so whether it's marketing or recommendation or sales or whatever, people are anyway already doing a lot of deep learning types of models there. That's a routine tool in what they do. And now if we think about that, at least in my opinion, that these fields are getting closer to each other.
So we have more and more deep learning techniques that are, like various and autoencoder is a prime example, but it is ultimately a Bayesian model in itself. This may actually be that they creep through that all this bayesian thinking and reasoning is actually getting into use by the next generation of these deep learning techniques that they are doing.
They've been building those models, they've been figuring out that they cannot get reliable estimates of uncertainty, they maybe tried some ensembles or whatnot. And they will be following. So once the tools are out there, there's good enough tutorials on how to use those. So they might start using things like, let's say, Bayesian neural networks or whatever the latest tool is at that point. And I think this may be the easiest way for the industries to do so.
They're not going to go switch back to very simple classical linear models when they do their analysis. But they're going to make their deep learning solutions Bayesian on some time scale. Maybe not tomorrow, but maybe in five years. Yeah, that's a very good point. Yeah, I love that. And of course, I'm very happy about that, being one of the actors making the industry more patient. So I have a vested interest in these. But yeah, also, I've seen the same evolution you were talking about.
Right now, it's not even really an issue of convincing people to use these kind of tools. I mean, still from time to time, but less and less. And now the question is really more in making those tools more accessible, more versatile, easier to use, more reliable, easier to deploy in industry, things like that, which is a really good point to be at for sure. And to some extent, I think it's... It's an interesting question also from the perspective of the tools.
So to some extent, it may mean that we just end up doing a lot of the kind of Bayesian analysis on top of what we would now call deep learning frameworks. And it's going to be, of course, it's going to be libraries building on top of those. So like NumPyro is a library building on PyTorch. But the syntax is kind of intentionally similar to what they've used in used to in the deep learning type of modeling these. And this is perfectly fine.
We are anyway using a lot of stochastic optimization routines in Bayesian inference and so on. So they are actually very good tools for building all kinds of Bayesian models. And I think this may be the layer where the industry use happens, that it's going to be always. They need the GPU type of scaling and everything there anyway. So just happy to have our systems. work on top of these libraries. Yeah, very good point.
And also to come back to one of the points you've made in passing, where education is helping a lot with that. You have been educating now the data scientists who go in industry. And I know in Finland, in France, not that much. Where are you originally from? But in Finland, I know there is this really great integration between the research part, the university and the industry. You can really see that in the PhD positions, in the professorship positions and stuff like that.
So I think that's really interesting and that's why I wanted to talk to you about that. To go back to the education part, what challenges and opportunities do you see in teaching Bayesian machine learning as you do at the university level? Yeah, it's challenging. I must say that. I mean, especially if we get to the point of well, Bayesian machine learning. So it is a combination of two topics that are somewhat difficult in itself.
So if we want to talk about normalizing flows and then we want to talk about statistical properties of estimators or MCMC convergence. So they require different kinds of mathematical tools. tools, they require a certain level of expertise on the software, on the programming side. So what it means actually is that it's even that if we look at the population of let's say data science students, we can always have a lot of people that are missing background on one of these sites.
So I think this is a difficult topic to teach. If it was a small class, it would be fine. But it appears to be that at least our students are really excited about these things. So I can launch a course with explicitly a title of a Bayesian machine learning, which is like an advanced level machine learning course. And I would still get 60 to 100 students enrolling on that course.
And then that means that within that group, there's going to be some CS students with almost no background on statistics. There's going to be some statisticians who certainly know how to program but they're not really used to thinking about GPU acceleration of a very large model. But it's interesting, I mean it's not an impossible thing. I think it is also a topic that you can kind of teach on a sufficient level for everyone.
So everyone agrees is able to understand the basic reasoning of why we are doing these things. Some of the students may struggle, figuring out all the math behind it. But they might still be able to use these tools very nicely. They might be able to say that if I do this and that kind of modification, I realize that my estimates are better calibrated. And some others are really then going deeper into figuring out why these things work.
So it just needs a bit of creativity on how do we do it and what do we expect from the students. What should they know once they've completed a course like this? Yeah, that makes sense. Do you have seen also an increase in the number of students in the recent years? Well, we get as many students as we can take.
So I mean, it's actually been for quite a while already that in our university, by far the most... popular master's programs and bachelor's programs are essentially data science and computer science. So we can't take in everyone we would want. So it actually looks to us that it's more or less like a stable number of students, but it's always been a large number since we launched, for example, the data science program. So it went up very fast. So there's definitely interest.
Yeah. Yeah. That's fantastic. And... So I've been taking a lot of your time. So we're going to start to close up the show, but there are at least two questions I want to get your insight on. And the first one is, what do you think the biggest hurdle in the Bayesian workflow currently is? We've talked about that a bit already, but how do you want to get your structured answer? Well, I think the first thing is that getting people to actually start using more or less systematic workflows.
I mean, the idea is great. We kind of know more or less how we should be thinking about it, but it's a very complex object. So we're going to be able to tell experts, statisticians that, yes, this is roughly how you should do. Then we should still also convince them that, like, almost force them to stick to it. But then especially if we then think about newcomers, people who are just starting with these things, it's a very complicated thing.
So if you would need to read 50 page book or 100 page book about Bayesian workflow to even know how to do it, it's a technical challenge. So I think in long term, we are going to get essentially tools for assisting it. So really kind of streamlining the process. thinking of something like an AI assistant for a person building a model that they really kind of pull you that now I see that you are trying to go there and do this, but I see that you haven't done prior predictive checks.
I actually already created some plots for you. Please take a look at these and confirm that is this what you were expecting? And it's going to be a lot of effort in creating those. It's something that we've been kind of trying to think about. how to do it, but it's still. I think that's where the challenge is. We know most of the stuff within the workflow, roughly how it should be done. At least we have good enough solutions.
But then really kind of helping people to actually follow these principles, that's gonna be hard. Yeah, yeah, yeah. But damn, that would be super cool. Like talking about something like a Javis, you know, like the AI assistant environment, a Javis, but for... Beijing models, how cool would that be? Love that. And looking forward, how do you see Beijing methods evolving with artificial intelligence research? Yeah, I think.
For quite a while I was about to say that, like I've been kind of building this basic idea that the deep learning models as such will become more and more basic in any way. So that's kind of a given. But now of course, now the recent very large scale AI models, they're getting so big that then the question of computational resources is, it's a major hurdle to do learning for those models, even in the crudest possible way.
So it may be that there's of course kind of clear needs for uncertainty quantification in the large language model type of scopes. They are really kind of unreliable. They're really poor at, for example, evaluating their own confidence. So there's been some examples that if you ask how sure you are about these states, more or less irrespective of the statement, give similar number. Yeah, 50 % sure. I don't know.
So it may be that the It's not really, at least on a very short run, it's not going to be the Bayesian techniques that really sells all the uncertainty quantification in those type of models. In the long term, it maybe is. But I think there's a lot of... It's going to be interesting. It looks to me a bit that it's a lot of stuff that's built on top of... To address specific limitations of these large language models, it is... separate components.
It's some sort of an external tool that reads in those inputs or it's an external tool that the LLM can use. So maybe this is going to be this kind of a separate element that somehow integrates. So an LLM, of course, could be having an API interface where it can query, let's say, use tan. to figure out an answer to type of a question that requires probabilistic reasoning.
So people have been plugging in, there's this public famous examples where you can query like some mathematical reasoning engines and so on. So that the LLM, if you ask a specific type of a question, it goes outside of its own realm and does something. It already kind of knows how to program, so maybe we just need to teach LLMs to do statistical inference. by relying on actually running an MCMC algorithm on a model that they kind of specify together with the user.
I don't know whether anyone is actually working on that. It's something that just came to my mind. So I haven't really thought about this too much. Yeah, but again, we're getting so many PhD ideas for people right now. We are. Yeah, I feel like we should be doing the best of all your... Awesome PhD ideas. Awesome. Well, I still have so many questions for you, but let's go to the show because I don't want to take too much of your time. I know it's getting late in Finland.
So let's close up the show and ask you the last two questions. I always ask at the end of the show. First one, if you had unlimited time and resources, which problem would you try to solve? Let's see. The lazy answer is that I am now trying to get unlimited resources, well, not unlimited resources, but I'm really trying to tackle this prior elicitation question.
I think most of the other parts on the Bayesian workflow are kind of, we have reasonably good solutions for those, but this whole question of really how to figure out complex multivariate priors over arbitrary complex models. That's a very practical thing that I am investing on. But maybe if I'm kind of taking, if it really is infinite, then maybe I could actually continue on the quick idea that we just talked about.
That I mean really getting this probabilistic reasoning at the core of these large language model type of AI applications. That it would really be reliably answering proper probabilistic judgments of the kind of decision -making reasoning problems that we ask from them. So that would be interesting. Yeah. Yeah, for sure. And second question, if you could have dinner with any great scientific mind, dead or alive or fictional, who would it be?
Yes, this is something I actually thought about it because I figured you would be asking it also from me. And I chose that I mean fictional characters. I like fictional characters. So I went with... Daniel Waterhouse from Niels Deffensen's The Baroque Cycle books. So they are kind of semi -historical books. So they talk about the era where Isaac Newton and others are kind of living and establishing the Royal Society. And there's a lot of high fantasy components involved.
And Daniel Waterhouse in those novels is his roommate of Isaac Newton and a friend. of Gottfried Leibniz. So he knows both sides of this great debate on who invented calculus and who copied whom. So if I had a dinner with him, I would get to talk about these innovations that I think are one of the foundational ones. But I wouldn't actually need to get involved with either party. I wouldn't need to choose sides, whether it's Isaac or Gottfried that I would be talking to. Love it.
Yeah, love that answer. Make sure to record that dinner and post it on YouTube. I'm pretty sure lots of people will be interested in it. Fantastic. Thanks. Thanks a lot, Arto. That was a great discussion. Really happy we could go through the, well, not the whole depth of what you do because you do so many things, but a good chunk of it. So I'm really happy about that. As usual, I'll put resources and a link to your website in the show notes for those who want to dig deeper.
Thank you again, Akto, for taking the time and being on this show. Thank you very much. It was my pleasure. I really enjoyed the discussion. This has been another episode of Learning Bayesian Statistics. Be sure to rate, review, and follow the show on your favorite podcatcher, and visit learnbaystats .com for more resources about today's topics, as well as access to more episodes to help you reach true Bayesian state of mind. That's learnbaystats .com.
Our theme music is Good Bayesian by Baba Brinkman, fit MC Lass and Meghiraam. Check out his awesome work at bababrinkman .com. I'm your host. Alex and Dora. You can follow me on Twitter at Alex underscore and Dora like the country. You can support the show and unlock exclusive benefits by visiting patreon .com slash LearnBasedDance. Thank you so much for listening and for your support.
You're truly a good Bayesian change your predictions after taking information and if you think and I'll be less than amazing. Let me show you how to be a good Bayesian. Change calculations after taking fresh data in. Those predictions that your brain is making. Let's get them on a solid foundation.