#110 Unpacking Bayesian Methods in AI with Sam Duffield - podcast episode cover

#110 Unpacking Bayesian Methods in AI with Sam Duffield

Jul 10, 20241 hr 12 minSeason 1Ep. 110
--:--
--:--
Listen in podcast apps:

Episode description

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!


Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work!

Visit our Patreon page to unlock exclusive Bayesian swag ;)

Takeaways:

  • Use mini-batch methods to efficiently process large datasets within Bayesian frameworks in enterprise AI applications.
  • Apply approximate inference techniques, like stochastic gradient MCMC and Laplace approximation, to optimize Bayesian analysis in practical settings.
  • Explore thermodynamic computing to significantly speed up Bayesian computations, enhancing model efficiency and scalability.
  • Leverage the Posteriors python package for flexible and integrated Bayesian analysis in modern machine learning workflows.
  • Overcome challenges in Bayesian inference by simplifying complex concepts for non-expert audiences, ensuring the practical application of statistical models.
  • Address the intricacies of model assumptions and communicate effectively to non-technical stakeholders to enhance decision-making processes.

Chapters:

00:00 Introduction to Large-Scale Machine Learning

11:26 Scalable and Flexible Bayesian Inference with Posteriors

25:56 The Role of Temperature in Bayesian Models

32:30 Stochastic Gradient MCMC for Large Datasets

36:12 Introducing Posteriors: Bayesian Inference in Machine Learning

41:22 Uncertainty Quantification and Improved Predictions

52:05 Supporting New Algorithms and Arbitrary Likelihoods

59:16 Thermodynamic Computing

01:06:22 Decoupling Model Specification, Data Generation, and Inference

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal

Transcript

Folks, strap in, because today's episode is a deep dive into the fascinating world of large -scale machine learning. And who better to guide us through this journey than Sam Dofeld. Currently honing his expertise at normal computing, Sam has an impressive background that bridges the theoretical and practical realms of Bayesian statistics, from quantum computation to the cutting edge of AI technology.

In our discussion, Sam breaks down complex topics such as the posterior's Python package, minimatch methods, approximate inference, and the intriguing world of thermodynamic hardware for statistics. Yeah, I didn't know what that was either. We delve into how these advanced methods like stochastic gradient MCMC and Laplace approximation are not just theoretical concepts but pivotal in shaping enterprise AI models today.

And Sam is not just about algorithms and models, he is a sports enthusiast who loves football, tennis and squash. and he recently returned from an awe -inspiring trip to the Faroe Islands. So join us as we explore the future of AI with Bayesian methods. This is Learning Bayesian Statistics, episode 110, recorded May 31, 2024. Welcome to Learning Bayesian Statistics, a podcast about Bayesian inference, the methods, the projects, and the people who make it possible. I'm your host, Alex Andorra.

You can follow me on Twitter at alex -underscore -andorra. like the country. For any info about the show, learnbasedats .com is Laplace to be. Show notes, becoming a corporate sponsor, unlocking Beijing Merch, supporting the show on Patreon, everything is in there. That's learnbasedats .com. If you're interested in one -on -one mentorship, online courses, or statistical consulting, feel free to reach out and book a call at topmate .io slash alex underscore and dora.

See you around, folks, and best Beijing wishes to you all. I'm Sam Duffield, welcome to Learning Bayesian Statistics. Thanks, thank you very much. Yeah, thank you so much for taking the time. I invited you on the show because I saw what you guys at normal computing were doing, especially with the Posteriors Python package. And I am personally always learning new stuff.

Right now I'm learning a lot about sports analytics, because that's a Like that's always been a personal pet peeves of mine and Bayesian says extremely useful in that field. But I'm also in conjunction working a lot about LLMs and the interaction with the Bayesian framework. I've been working much more on the base flow package, which we've talked about with Marvin Schmidt in episode 107.

So. Yeah, working on developing a PIMC bridge to base flow so that you can write your model in PIMC and then like using amortized patient inference for your PIMC models. It's still like way, way down the road. I need to learn about all that stuff, but that's really fascinating. I love that. And so of course, when I saw what you were doing with Posterior, I was like, that sounds... Awesome. I want to learn more about that. So I'm going to ask you a lot of questions, a lot of things I don't know.

So that's great. But first, can you tell us, give us a brief overview of your research interests and how Bayesian methods play a role in your work? Yeah, no, I know. Thanks again for the invite. I think, yeah, sports analytics, Bayesian statistics, language models, I think we have a lot to talk about. should be fun. Bayesian methods in my work, yes, so at normal we have a lot of problems where we think that Bayes is the right answer if you could compute it exactly.

So what we're trying to do is trying to look at different approximations and different, like how they scale in different methods and different settings. and how we can get as close to the exact phase or the exact sort of integral and updating under uncertainty that can provide us with some of those benefits. Yeah. OK. Yeah. That's interesting. I, of course, agree. Of course. Can you, like, actually, do you remember when you were first introduced to patient inference?

Because you had a an extensive background you've studied a lot. When in that, in those studies, were you introduced to the Bayesian framework? And also, how did you end up working on what you're working on nowadays? Yeah, okay. I'll try not to rant too long about this. But yeah, so I guess I, yeah, mathematics, undergraduate at Imperial. So I think that's I was very young at this stage, we were very young in our undergraduates, so not really sure what we want to do.

At some point, it came to me that statistics within the field of mathematics is kind of like where I can like, that should be working on like, applied problems and how what where the sort of field is going. And that's what got me excited.

Statistics at undergraduate are different, different places, but you get thrown a lot of different I mean, probably in all courses, you get different, you get different point of view and you get like, yeah, you get your frequencies, your hypothesis testing, and then you have your Bayesian method as well.

And that is just the Bayesian approach really sort of settled with me as being more natural in terms of you just write down the equation and the Bayes Bayes Bayes theorem handles you write down, you have your forward model and your prior and then Bayes theorem handles everything else. So you're kind of writing down it's like, mathematicians is kind of like one of the lecturers in my first year said, yeah, mathematicians are lazy. You want to they want to do as little as possible.

So base theorem is kind of nice there because you just write down your your likelihood you write down your prior and then base theorem handles the rest. So you have to do like the minimum possible work you have your data likelihood prior and then done. So that was that was really compelling to me. And that led me to a to my PhD, which was in the engineering department in Cambridge.

So that was like, yeah, I had a few thoughts on what to do for my PhD. There was some more theoretical stuff and I wanted to get into some problems, get into the weeds a bit. So yeah, engineering department of Cambridge working on Bayesian statistics, state space models and in a state space model sequential Monte Carlo. And I think, yeah, I mean, for terminology wise, I use state space model and hidden Markov model as the same thing.

So yeah, you have this time series style data and that was working on that sort of data gave me I feel like this propagation of uncertainty really shines there because you need to take into account your uncertainty from the previous experiments, say, when you update for your new ones. That was really compelling for me. That was, I guess, my route into Bayesian statistics. Yeah, okay. Actually, here I could ask you a lot of questions, but... those time series models.

I'm always fascinated by time series models. I don't know, I love them for some reason. I find there is a kind of magic in the ability of a model to take time dependencies into account. I love using Gaussian processes for that. So I could definitely go down that rabbit hole, but I'm afraid then I won't have enough time for you to talk about post -series. Let me just say one minute about it. So I'll just say like, yeah, in terms of yeah, Gaussian process is really cool.

Like Gaussian process, like can think of as like a continuous time or continuous space or whatever that the time variant access, we'll call it time continuous time varying version of a state space model and state space model or hidden Markov model.

Kind of like that to me is like the canonical extension of a just a static based inference model to a the time varying setting because you can and they kind of unify each other because you can write smoothing in a state space model as one big static Bayesian inference problem and then you can write static Bayesian inference problems they're just p of y given x or p of yeah recovering x from from y as as the first step as a

single step of state space model so the techniques that you build just overlap and you can yeah at least conceptually on the mathematical level when you actually get into the approximations and the commutation there are different things to consider, different axes of scalability considered, but conceptually, I really like that. I probably ranted for a bit more than a minute there, so I apologize. No, no, that's fine. I love that.

Yeah. I have much more knowledge and experience on GPs, but I'm definitely super curious to also apply these state space models and so on. So definitely going to read the... the paper you sent me about skill rating of football players where you're using, if I understand correctly, some state space models. That's going to be two birds with one stone. So thanks a lot for writing that.

The whole point of that paper is to say that rating systems, ELO, TrueSkill are and should be reframed as state space models. And then you just have your full Bayesian understanding of it. Yeah, yeah. I mean, for sure. I'm working myself also on the project like that on football data. And yeah, the first thing I was doing is like, okay, I'm gonna write the simple model. But then as soon as I have that down, I'm gonna add a GP to that. It's like, I have to take these nonlinearities into account.

So yeah, I'm like, super excited about that. So thanks a lot for giving me some weekend readings. So actually now let's go into your posteriors package because I have so many questions about that. So could you give us an overview of the package, what motivated this development and also putting it in the context of large scale AI models? Yeah, so as I said, we at normal think the base is the right answer.

So we want to get, we want to, but yeah, we're interested in large scale enterprise AI models. So we need to be able to scale these to big, big models, big, big parameter sizes and big data at the same time. So this is what Posterior's Python package built on PyTorch really hopes to bring. It's built with sort of flexibility and research in mind. So really we want to try out different methods and try out for different data sets and different goals. what's going to be the best approach for us.

That's the motivation of the Posteriors package. When would people use it? For instance, for which use cases would I use Posteriors? There's a lot of just genuinely fantastic Bayesian software out there. But most of it has focused on the full batch setting, as is classically the case with Metropolis Hastings, except for Jekste. And we feel like we're moving or we have already moved into the mini batch era, the big data era. So posterior is mini batch first.

So if you have a lot of data, even if you have a small model, and you have a lot of data, and you want to try posterior sampling with mini batches, you want to see how that... If that can speed up your inference rather than doing full batch on every step, then Posterior is the place for that, even with small models. So you can just write down your model in Pyro, in PyTorch, and then use Posterior to do that.

But then that's like moving from like classical Bayesian statistics into like the mini batch one. But then there are also benefits of Bayesian very crude approximations to the Bayesian posterior in these really high scale large scale models.

So like, yeah, like language models, big neural networks, these aren't going to get you you're not you're not going to be able to do your convergence checks and these sort of things in those models, but you might still be able to get some advantages out of distribution detection, as a distributed improved attribution performance sort of continual learning, and these are the sort of things we're investigating is if like, the sort of what if you just trained with grading essentially, you wouldn't

necessarily get these things. But even very crude, crude Bayesian approximations will hopefully provide these benefits. I think I will talk about this more later. I think. Yeah, okay. So basically, what what I understand is that you can use Posters for basically any model. So I mean, we're still young.

And it doesn't have like the very young and it doesn't have like the support of, I don't know, if you want to do Gaussian processes, we were not going to have a whole suite of kernels that you're going to be able to just type up. But fundamentally, it takes any, it just takes a function, a log posterior function, and then you will be able to try out different methods. But as I said, like the big data regime is much less researched, and as much and the sort of big parameter regime is much harder.

at least. So it's going to be, it's not going to be like a silver bullet. You're going to have to, there's research, basically, posterior is a tool for research a lot of the time where you're going to research what inference methods you can use, where they fail, and hopefully where they succeed as well. Okay. Okay. I see. And so to make sure listeners understand, well, you can do both in posers, right? You can write your model in posterior. and then sample from it?

Or is that only model definition or is that only model sampling? So it only does approximate posterior sampling. So you write down the log posterior, you're given some data and you write down the log posterior. Or the joint, you could say. It doesn't have the sophisticated support of Stan or IMC or where you actually have the, you can write down the model. but it has the support for all the distributions and doing forward samples.

It leans on other tools like Pyro or PyTorch itself for that in no other case. It is about approximate inference in the posterior space, in the sample space. So you can do Laplace approximation with these things and compare them. And importantly, it's mini -batch first. So every method only expects to receive batch by batch. So you can support the large data regime. Okay, so I think there are a bunch of terms we need to define here for listeners. Okay, yeah, sorry about that.

Can you define minibatch? Can you define approximate inference and in particular, Laplace approximation? Okay, so minibatch is the important one, of course. Yeah, so normally in traditional Bayesian statistics, if you're running random walk -through troblos -Hastings or HMC, you will be seeing your whole dataset, all end data points at every step of the iteration. And there's beautiful theory about that. But a lot of the time in machine learning, you have a billion data points.

Or if you're doing a foundation model, it's like all of Wikipedia, it's billions of data points or something like that. And there's just no way that every time you do a gradient step, you just can't sum over a billion data points. So you take 10 of them, you do this unbiased approximation. And this doesn't propagate through the exponential, which you need. for the metropolis hastening step. So it rules out a lot of traditional Bayesian methods, but there's still been research on this.

So this is the we saw a scalable Bayesian learning is what we talked about with posterior. So we're investigating mini batch methods. So yeah, methods that only use a small amount of the data, as is very common in so it's like gradient descent, stochastic gradient descent and optimization terms. So hopefully Mini -batches, okay, you said approximate inference. So approximate, okay, yeah, inference is a very loaded term.

Maybe I should try not to use it, but when I say approximate inference, I mean approximate Bayesian inference. So you can write down mathematically the posterior distribution, P of theta given y, and then yeah, proportional to P of theta, P of y given theta. But that's You only have access to pointwise evaluations of that and potentially even only mini -batch pointwise evaluation sets.

So approximate inference is forming some approximation to that posterior distribution, whether that's a Gaussian approximation or through Monte Carlo samples. So yeah, just like an ensemble of points and approximate inference. So that's approximate inference. And yeah, you have different fidelities of this posterior approximation. Last one, Laplace approximation.

Laplace approximation is the simplest arguably the simplest in like machine learning setting approximation to the posterior distribution. So it's just a Gaussian distribution. So all you need to define is a mean and covariance. You define the mean by doing an optimization procedure on your log posterior or just log likelihood. And that will give you a point that will give you your mean.

And then And then you take okay, it gets quite in the weeds the Laplace approximation, but ideally you you then do a Taylor expansion across them. Second order Taylor expansion will give you Hessian. We would recommend the Hessian being the co your approximate covariance. But there are tiny quantities there and use the Fisher information as said. And yeah, you can read that there's lots of I'm sure you've had people on the on the podcast explain it better than me.

Yeah. For Laplace, no. Actually, so that's why I asked you to define it. I'm happy to go down into the weeds if you want. Yeah, if you think that's useful. Otherwise, we can definitely do also an episode with someone you'd recommend to talk about Laplace approximation. Something I'd like to communicate to listeners is for them to understand. Yeah, we say approximation, but at the same time, MCMC is an approximation itself. So that can be a bit confusing.

Can you talk about the fact, like, about why these kind of methods, like Laplace approximation, I think VI, variational inference, would fall also into this bucket. Why are those methods called approximations? in contrast to MCMC? What's the main difference here?

I honestly I would say MCMC is also an approximation in the same terminology but yeah the difference is that we talk about bias and asymptotically some methods asymptotically unbiased which MCMC is stochastic gradient MCMC which is what Prosterus is as well in some under some caveats, and there are caveats for MCMC, normal MCMC as well. But yeah, so you have your Gaussian approximations from variational inference and the applies approximation.

And these are very much approximations in the sense there's no axes on which you can increase if you increase it to infinity or change the posterior. You cannot do that with the Gaussian approximations unless your posterior is you're known to be Gaussian, in which case is more and more I mean, the amount of interesting cases like that like Gaussian processes and things.

But yeah, so they don't have this asymptotically unbiased feature that MTMC does or important sampling as sequential Monte Carlo does, which is very useful because it allows you to trade compute for accuracy, which you can't do with a Laplace approximation or VI beyond extending, like going from diagonal covariance to a full covariance or things like that. And this is very useful in the case that you have extra compute available.

So I'm a big fan of the asymptotic unbiased property because it means that you can increase your compute and safety. Yeah. Yeah. Great explanation. Thanks a lot. And so yeah, but so as you were saying, there is not these asymptotic unbiasedness from these approximations, but at the same time, that means they can be way faster. So it's like if you're in the right use case in the right, in the right Yeah, in the right use case, then that really makes sense to use them.

But you have to be careful about the conditions where the approximation falls down. Can you maybe dive a bit deeper into stochastic gradient descent, which is the method that Posterioris is using, and how that fits into these different methods that you just talked about? Actually, stochastic gradient descent is not a method that Posterioris is using per se.

descent is stochastic gradient descent is the workhorse of machine most machine learning algorithms, but posteriors would kind of be this kind of same like it kind of saying it shouldn't be perhaps or like, not in all cases. stochastic gradient descent is what you use. If you have extremely large data, and you just want to find the MLE or so the maximum likelihood or the minimum of a loss, which you might say. So that is just as an optimization routine.

So you just want to find the parameters that minimize something. If you're doing variational inference, what you can do is you can trackively get the KL divergence between your specified variational distribution and the log posterior. And then you have parameters. So they're like parameters of the variational distribution over your model parameters. And then you use stochastic gradient system like that.

So this is nice because it just means that you can throw the workhorse from machine learning at a Bayesian problem and get the Bayesian approximation out. Again, as we mentioned, it doesn't have this asymptotic unbiased feature, which is maybe less of a concern in machine learning models where you have less of ability to trade compute because you've kind of filled your compute budget with your gigantic model.

Although we may see this, we think that I think this might change over the coming years. But yeah, maybe not. Maybe we'll just go even bigger and bigger and bigger. You... Okay, sorry. I got lost. You said you're asking about stochastic gradient descent. So actually, there's something interesting to say here. And then that means also what the main difference characteristics of posterior is, like these, so that really people understand the use case of posterior here. Yeah. So we didn't want to...

Okay. So yeah, there's a key thing about the way we've written posterior is that we like... where possible to have stochastic gradient descent, so optimization, as sort of limits under some hyperparameter specifications of the algorithms. And it turns out that in a lot of cases, so we talked about MCMC, and then we talked about stochastic gradient MCMC, which are MCMC methods that strictly handle mini -batch methods.

And a lot of the time, you can write down the temperature, you have the temperature parameter of your posterior distribution. And then as you take that to zero, So the temperature is like, if the temperature is very high, your posterior distribution is very heated up. So you've increased the tails and there's a lot like a much closer to sort of a uniform distribution. You take it very cold, it comes very pointed and focused around optima.

So we write the algorithms so that there's this convenient transition through the temperature. So you set the temperature to zero, you just get optimization. And this is a key thing about posteriors. So we have the, so the posteriors stochastic grain MCMC methods. this temperature parameter which if you set to zero will become a variant of stochastic gradient descent.

So you can just sort of unify gradient descent and stochastic gradient MCMC and it's nice so you have your yeah you have your Langevin dynamics which tempered down to zero just becomes vanilla gradient descent you have underdamped Langevin dynamics or stochastic gradient HMC, stochastic gradient Hamiltonian Monte Carlo, you set the temperature to zero and then you've just got stochastic gradient descent with momentum.

So yeah, this is a nice thing about Posterius to sort of unify these approaches and it hopefully will make it less scary to use Bayesian approaches because you know you always have gradient descent and you can sanity check by just setting the temp, just filling with a temperature parameter. Okay, that's really cool.

Okay. So it's like, it's a bit like the temperature parameter in the, in the transformers that, that like make sure, I mean, in the LLMs that It's like adding a bit of variation on top of the prediction stat that the LL could make. Yeah, so it's exactly the same as that. So when you use this in language models or natural language generation, you temperature the generative distribution so that the logits get tempered. So if you set the temperature there to zero, you get greedy sampling.

But we're doing this in parameter space. So it's, yeah. It has this, yeah, exactly. Distribution tempering is a broad thing, particularly in, I'm not going to go too philosophical, but I mean, I've first met with like tempering, then we thought about it in the settings of sequential Monte Carlo, and it's like, is it the natural way? Is it something that's natural to do?

But in the context of Bayes, because Bayes' theorem is multiplicative, right, you have your P of theta, P of y given theta, it kind of makes sense to temper because it means like, okay, I'll just introduce the likelihood a little bit. and sort of tempering as a natural way to do it because there's multiplicative feature of Bayes' theorem. So, I kind of settled with me after thinking about it like that. Yeah, no, I mean, that makes perfect sense.

And I was really surprised to see that was used in LLMs when I first read about the algorithms. And I was pleasantly surprised because I've worked a lot on electoral forecasting models. That's how I were introduced to Bayesian stats. Actually, I've done that without knowing it. So first I'm using the softmax all the time because they're called forecasting. Unless you're doing that in the U S you need a multinomial likelihood. The multinomial needs a probability distribution.

And how do you get that from the softmax function, which is actually a very important one in the LLM framework. And, and, and also the thing is your probability is not, it's like the latent. observation of popularity of each party, but you never observe it, right? And so the polls, you could, you could like conceptualize them as a tempered version of the true latent popularity. And so that was really interesting.

I was like, damn, this like, this, this stuff is much more powerful than what I thought, because I was like applying only on electoral forecasting models, which is like a very niche application, you could say of these models in actually there are so many applications of that in the wild. No, it's so yeah, tempering in general is very widespread and also I would say not particularly understood that well.

Like yeah, we have this thing, there's been research in this cold posterior effect which is quite a, I don't know, it's perhaps a... annoying things for Bayesian modeling on neural networks where you get, as I said, you have this temperature parameter that transitions between optimization and the Bayesian posterior. So zero is optimization, one is the Bayesian posterior.

And empirically, we see better predictive performance, which is a lot of time we care about in machine learning, with temperatures less than one. So like, yeah, which is annoying because we're Bayesians and we think that the Bayesian posterior is the optimal decision -making under uncertainty.

So this is annoying, but at least in our experiments, we found this to be this so -called cold posterior effect much more prominent under Gaussian approximations, which we only believe to be very crude approximations to the posterior anyway. And if we do more MCMC or deep ensemble stuff, where deep ensemble is, we've got a paper we'll be able to archive shortly, which describes deep ensembles. In deep ensembles, you just run gradient descent in parallel.

with different initializations and batch shuffling. And then you just have like, I know you run 10 ensembles, 10 optimizations in parallel, then you've got 10 parameters at the things at the end. So Monte Carlo approximation posterior size 10. And then we describe in the paper that how to get this asymptotic and biased property by using that temperature. Because as we said earlier, you have SG MCMC becomes SGD with temperature zero. So you can reverse this.

for deep ensembles, so you add the noise and then you'll get an asymptotic and biased deep ensembles become asymptotically unbiased MCMC between SGMC and PSE. But in those cases when you have the non -Gaussian approximation we found much less of the cold posterior effect. So yeah, it's, but it's still not, maybe the cold posterior effect is a natural thing because it's not really like Bayes' theorem. Yeah, we still need to be better understood. I don't, at least in my head I'm not.

fully clear on whether the cold posterior effect is something we should be surprised about. Okay, yeah. Yeah, me neither. That makes you feel any better because I just learned about that. So yeah, I don't have any strong opinion. Okay, I think we're getting clearer now on the like the what posterior ears is for listeners. So then I think one of the last question about the algorithms that that's underlying all of that. So, stochastic gradient MCMC. That's, that's where I got confused.

Like I hear stochastic gradient and like stochastic gradient isn't, but no, it's like SG MCMC not SGG. So, Posteriority is like really to use SG MCMC. Why, like, why would you do that and not use MCMC? like the classic HMC from Stan or PyMC? Yeah, so I mean, it's not just for SGMCMC. There's also variational inference, Laplace approximation, extended count filter, and we're really excited to have more methods as well as we look to maintain and expand the library. Why would you use SGMCMC?

So yeah, I think we've already touched upon this. The thing is, if you've got loads of data, it's just going to be inefficient to... sum over all of that data at every iteration of your MCMC algorithm as Stan would do. But there's mathematical reasons why you can't just do that in Stan. It's because the Metropolis -Hastings ratio has this exponential of the log posterior.

But it's in log space is the only place you can get the unbiased approximation, which is what you need if you did want to naively subsample. So you need to, you can't do the Machrofist Hastings except reject. So you have to use different toolage. And in its simplest terms, SGMCMC just omits it and just runs a Langevin. So it just runs your Hamiltonian Monte Carlo without the extract project.

But there's more theory on top of this and you need to control the disqualification error and stuff like that. And I won't go into the weeds of that. Okay. Yeah. Okay. And that's And that's tied to mini -batching basically. Like the power that SGMCMC allows you when you're in a high data regime is tied to the mini -batching, if I understand correctly. It's the difference between MCMC and SGMCMC. Okay, so that's like the main difference. Okay. Yeah, stochastic gradient.

So you can't actually get the exact gradient like you need in Amazigh in Monte Carlo and for Metropolis Hastings step, you only get an unbiased approximation. And then there's theory about this is like sometimes you can deploy the central limit theorem and then you've got a you can go covariance attached to your gradients and you could do nice theory and improve the equivalence like that, which, yeah. Okay. All clear now. All clear. Awesome.

Yeah. And I think that's the first time we talk about that on the show. So I think it was it's definitely useful to be extra clear about that. And so that listeners understand and me, like myself, so that I understand. Thanks a lot. It's in some setting actually much simpler because you kind of like remove the tools that you have available to you by removing that much of the step. So it makes the implementation a bit simpler. But you kind of lose the theory in that.

And then a lot of the argument is like if you use a decreasing step size, then your noise from the mini match, your noise from the stochastic gradient decreases Epsilon squared, which is faster. So you If you decrease your step size and run it for infinite time, then you'll just be running, eventually just be running the continuous time dynamics, which are exact and do have the right stationary distribution. So if you run it with decreasing step size, then you are asymptotically unbiased.

But running with decreasing step size is really annoying because you then don't move as far. As we know from normal MCMC, we want to increase our step size and move and explore the posterior more so. There's lots of research to be done here. I hope and I feel that it's not the last time you'll talk about stochastic gradient MCMC on this podcast. Yeah, no. I mean, that sounds super interesting. I'm really interested also to really understand the difference between these algorithms.

Right now, that's really at the frontier of research. You not only have a lot of research done on how do you make HMC more efficient, but you have all these new algorithms. approximate algorithms as we said before. So, VLM plus approximation, stuff like that. But also now you have normalizing flows. We talked about that in episode 98 with Marilou Gabrié. Marilou Gabrié, actually, I don't know why I said the second part with the Spanish.

Because my Spanish is really available in my brain right now. So, she's French. So, that's Marilou Gabrié. Episode 98, it's in the show notes. Episode 107, I already mentioned it with Marvin Schmidt about amortized patient inference. Actually, do you know about amortized patient inference and normalizing flows? I know a bit about normalizing flows. Amortized patient inference I would be less comfortable with. Okay. But I mean, if you could explain it.

Yeah, I haven't watched that episode and listened to that episode. Yeah, I mean, we released it yesterday. Yeah, I don't... I'm a bit disappointed, Sam, but that's fine. Like, it's just one day, you know. If you listen to it just after the recording, I'll forgive you. That's okay. No, so, kidding aside, I'm actually curious to hear you speak about the difference between normalizing flows and SGMCMC. Can you talk a bit about that if you're comfortable with that? I mean, I can't.

It's been a while since I've read about normalizing flows. When I did read about them, I understood it to be essentially a form of variational inference where you have more elaborate, you define a more elaborate variational family through like, essentially through like a triangular mapping. Like, the thing why you can't just use someone might say, Why can't you use it just a neural network as your variational distribution? And it's not so easy because you need to have this tractable form.

Hang on a second. Let me remember. But the thing is with normalizing flows, you can get this because you can invert. That's it. They're invertible, right? Normalizing flows are invertible. So you can get this. You can write the change of distribution formula and then you can calculate essentially just y -maxum likelihood. the using these normalizing flows to fit to a distribution. Whereas SGMCMC doesn't.

So you have to, in normalizing flows, you kind of have to define your ansatz that will fit to your distribution. I think normalizing flows are really exciting and really interesting, but yeah, you have to specify your ansatz. So it's another, so there's another tool on top, another specification on top of how you. rather than just writing the log posterior, you then need to find an approximate ansatz which you think will fit the posterior or the distribution you're targeting.

Whereas SGMCMC is just log posterior, go. Which is sort of what we're trying to do with posterior, is we're trying to automate, well not automate, we're trying to research, of course, so much for that. But normalizing flows might be, yeah, as I said, I think it's really interesting that you can get these more expressive variational families through like triangular mappings, yeah. Yeah, super interesting.

And yeah, I'm also like spatial inference is related in the sense that you first feed a deep neural network on your model. And then once it's feed, you get posterior inference for free, basically. So that's quite different from what I understand as GMC to be. But that's also extremely interesting.

That's also why I'm hammering you down on the different use cases of SGMCMC so that myself and listeners have a kind of a tree in their head of like, okay, my use case then is more appropriate for SGMCMC or, no, here I'd like to try multi -spacian inference or, I know here I can just stick to plain vanilla HMC. I think that's very interesting. But thanks for that question that was completely improvised.

I definitely appreciate you taking the time to rack your brain about the difference with normalizing flows. No, I'd love to talk more on that. I'd need to refresh myself. I've written down some notes on normalizing flows, and I was quite comfortable with them, but it's just been a while since I refreshed. So I would love to refresh, and then we can chat about them. Because I'd love to do a project on them, or I'd love to work on them, because I think that's it.

way to fit distribution to data, which is, after all, a lot of what we do. Yeah. Yeah. So that makes me think we should probably do another episode about normalizing flows. So listeners, if there is a researcher you like who does a lot of normalizing flows and you think would be a good guest on the show, please reach out to me and I'll make that happen.

Now let's let's get you closer to home salmon and talk about posteriors again Because so basically if understood correctly posteriors aims to address uncertainty quantification in deep learning Why it's is that my right here and also if that's the case why is this particularly important for neural networks and How does the package help in? managing especially overconfident in model predict, overconfidence in model predictions. Yeah, so it's that's our primary use case.

And normal is to use posterity as a proximate base, we're getting as close to base as we can, which is probably not that close, but still getting somewhere on the way to base base, base and posterior in big deep learning models. But we feel posterior is to be as modular and general as possible. So as I said, if you have a classical Bayesian model, you can write it down in Pyro, but you've got loads of data, then okay, go ahead. And it posterior should be well suited to that.

In terms of what advantages we want to see from uncertainty communication or this approximate Bayesian inference in deep learning models, there are three sorts of key things that we distilled it down to. So yeah, you mentioned confidence in outer distribution predictions. So yeah, we should be able to improve our performance in predicting on inputs that we haven't seen in the training set. So I'll talk about that after this.

The second one is continual learning, where we think that if you can do Bayes theorem exactly, you have your prior, you get some likelihood and you have the likelihood, you get some data, you have a posterior, then you get some more data. and then your posterior becomes your prior and do the update. And you can just write like that if you can do Bayes' theorem exactly.

And then, yeah, this is, you can extend it even further and then you have, with some sort of evolution along your parameters, then you have a state space model, and then the exact setting linear Gaussian, you've got a count filter. So continual learning is, in this case, Bayes' theorem does that exactly. And in continual learning research in machine learning settings, they have this term of avoiding catastrophic forgetting.

So, If you just continue to do gradient descent, there was no memory there, so you would just, apart from the initialization, you would just forget what you've done previously and there's lots of evidence for this, whereas Bayes' theorem is completely exchangeable between of the order of the data that you see. So you're doing Bayes' theorem exactly, there's no forgetting, you just have the capacity of the model.

So that's where we see Bayes solving continual learning, but as I said, you can't can't do Bayes' theorem exactly in a billion -dimensional model. And then the last one is, we'll call it like decomposition of uncertainty in your predictions.

So if you just have gradient descent model and you're predicting reviews, someone's reviews and you have to predict the stars, it will just give you, as you said, it gives you your softmax, it'll just give you this distribution over the reviews and it'll be like that. But what you really want is you want to have some indication of like also distribution detection, right, you want to know, okay, yeah, I'm confident in my, my prediction.

And you might get to review that is like, the food was terrible, but the service was amazing, or something like that, like a user amazing food was terrible. And then, like, let's say we're perfect models, say of this, we know how people review things, but we can we can give, we have quite a lot of uncertainty under review, because we don't know how the reviewer values those different things. So we might have just a completely uniform. distribution over the stars for that review.

But we'd be confident in that distribution. But what Bayes gives you is it gives you the ability to the sort of the second order uncertainty quantification, is if you have this distribution over parameters and you have a distribution over logits at the end, the predictions, you can identify, you can split between it from information theories called aleatoric and epistemic uncertainty.

Aleatoric uncertainty or data uncertainty is what I just described there, which is natural uncertainty in the model and the data generating process. Epistemic uncertainty is uncertainty that was removed in the infinite data limit. So that would be where the model doesn't know. So this is really important for us to quantify that. Okay. I, yeah, around to the bit there.

I can in like 30 seconds elaborate on the point you specifically mentioned on alpha distribution performance and improving performance and alpha distribution. And I think that's quite compelling from a Bayesian point of view, because what Bayes says on like a supervised learning sector setting is said, gradient descent just fits one parameter, finds one parameter configuration that's plausible given the training data.

Bayes' theorem says, I find the whole distribution of parameter configurations that's plausible given the data. And then when we make predictions, we average over those. So it's perfectly natural to think that a single configuration might overfit. and might just give, it might just be very confident in its prediction when it sees out the distribution data.

But it doesn't necessarily solve a bad model, but it should be more honest to the model and the data generating process you've specified is if you average over plausible model configurations under the training data when you have your testing. So that's sort of quite a compelling, to me, argument for improving performance on after distribution predictions, like the accuracy of them.

And there's a fair bit of empirical evidence for this, with the caveat again, being that the Bayesian posterior in high dimensional models, machine learning models is pretty hard to approximate, cold posterior effect, caveats, things like these things. Okay, yeah, I see. Yeah, super interesting in that. So now I understand better. what you have on the posteriors website, but the different kind of uncertainties. So definitely that's something I recommend listeners to give a read to.

I put that in the show notes. So both your blog post introducing posteriors and the docs for posteriors, because I think it makes that clear combined to your explanation right now. Yeah. And... Something I was also wondering is that if I understood correctly, the package is built on top of PyTorch, right? Yeah, that's correct. Yeah. Okay. So, and also, did I understand correctly that you can integrate posteriors with pre -trained LLMs like Lama2 and Mistral, and you do that with a...

Hacking's Feast Transformers package? So, yeah, so, I mean, yeah, Posterior is open source. We're fully supported the open source community for machine learning, for statistics, which is, and in terms of, yeah, I mean, we're sort of in the fine tuning era or like we have like, there's so much, there are these open source models and you can't get away from them. We have that Lama 2, Lama 3, Mistral, like, yeah. And basically we want to harness this power, right?

But as I mentioned previously, there are some issues that we like to remedy with Bayesian techniques. So the majority of these open source models are built in PyTorch. I'm also a big Jax fan. I also use Jax a lot. So I was very happy to see and work with the torch .funk like sub library, which basically makes it you can write your PyTorch code and you can use Llama 3 or Mistral with PyTorch but writing functional code. So that's what we've done with Posterior.

So, yeah, Hugging Face Transformers, you can download the models, that's where all they're hosted, and how you access them. But then what you get is just a PyTorch model. It's just a PyTorch model. And then you throw that in Composers and all nicely with the Posterior updates. Or you write your own new updates in the Posterior framework and you can use that as well. still with Lama 3. Mr. Robin. Yeah. Okay. Nice. And so what does it mean concretely for users?

That means you can use these pre -trained LLMs with posteriors and that means adding a layer of uncertainty quantification on top of those models? Yeah. So you need, I mean, Bayes theorem is a training theorem. So you need data as well.

So you take You take your pre -trained model, which is, yeah, transformer, or it could be another type of model, it could be an image model or something like that, and then you give it some new data, which we would say was fine -tuning, and then you combine, use posterior to combine the two, and then you have your new model out at the end of the day, which has uncertainty quantification. It's difficult, as I said, we're sort of in this fine -tuning era as open -source large language models.

It's still to be, this is different. There's still lots of research to do here and it's different to our classical Bayesian regime where we just have our, there's only one source of data and it's what we give it. In this case, there's two sources of data because you have your data, whatever, whatever Lama3 saw in its original training data and then it has your own data. It's, yeah, can we hope to get uncertainty chronification and the data that they used in the original training?

Probably not, but we might be able to get uncertainty chronification and improved predictions. based on the data that we've committed. So there's lots of lots for us to try out here and learn because we are still learning on this in terms of the fine tuning. But yeah, this is what Polastir is there to make these sort of questions as easy as possible to ask and answer. Okay, fantastic. Yeah, that's, that's so exciting.

It's just like, it's a bit frustrating to me because I'm like, I'd love to try that and learn on that and like, contribute to that kind of packages. At the same time, I have to work, I have to do the podcast, and I have all the packages I'm already contributing to. So I'm like, my god, too much choices too much, too many choices. No, come on Alex, I'm gonna see you. We're gonna see you again, Alex pull request. It's soon enough.

Actually, how does like, do the like this ability to have the transformers in, you know, use these pre trained models, does that help facilitate the adoption of new algorithms in in posteriors? Because if I understand correctly, you can support new algorithms pretty easily and you can support arbitrary likelihoods. How do you do that? I wouldn't say that the existence of the pre -trained models necessarily allows us to support new algorithms.

I feel like we've built the posterior to be suitably general and suitably modular, that it's kind of agnostic to your model choice and your log posterior choice. terms of arbitrary likelihoods. But yeah, that's like a benefit. That's like, yeah, as an hour, yeah, the arbitrary like is is relevant, because a lot of machine learning packages on. I mean, a lot of machine learning is essentially just boils down to classification or regression. And that is true.

And because of that, a lot of a lot of machine learning algorithms will a lot of machine learning packages will essentially constrain it to classification or regression. At the end, you either have your softmax or you have your mean squared error. Yeah, softmax cross entry means greater. In posterior, we haven't done that. We're more faithful to the sort of the Bayesian setting where you just write down your log posterior and you can write down whatever you want.

And this allows you greater flexibility in the case you did want to try out a different likelihood or like even in like simple cases, like it's just more sophisticated than just classification or regression a lot of the time. Like in sequence generation where you have the sequence and then you have the cross entropy over all of that. It just allows you to be more flexible and write the code how you want. And there's additional things to be taken into account.

Like sometimes if you were doing a regression, you might have knowledge of the noise variance. And that's just the observation noise variance. And that's just much easier to, yeah, if we don't constrain like this, it's just much easier to write your code much cleaner code than if you were. And it's also future -proofing. We don't know what's going to be. happening in going forward.

We may see like, yeah, in multimodal models, we may see like, text and images together, in which case, yeah, we will support that. You have to supply the compute and the data, which might be the harder thing, but we'll support those likelihoods. Okay, I see. I see. Yeah, that's very, very interesting. Any stats related to the fact that I think I've read in your blog post or on the website that You say that Posterior is swappable. What does that mean? And how does that flexibility benefit users?

Yeah. So, I mean, this is the point of swappable is that when I say that is that you can change between if you want to, if you think, as I said, Posterior is a research like toolbox and it's to us to investigate which inference method is appropriate in the different settings, which might be different if you care about decomposing. predictive uncertainty, it might be different if you care about boarding cast -off, you're forgetting it's in your continued learning.

So the thing is that you can just, the way it's written is you can just swap, you can go from sthmc and you can go to the class approximation or you can go to vi just by changing one line of code. And the way it works is like you have your builds, you have your transform equals posterior .infant method .build and then any configuration argument, step size. things like this, which are algorithm specific. And then after that is all unified.

So you just have your init around the parameters that you want to do based on. And then you iterate through your data loader, you iterate through your data. And then it just updates based on the batch. And batch can be very general. So that's what it means. So you can just change one line of code to swap between Variational Imprints and STHMC or Extended Calama Filter or any and all the new methods that the listeners are going to add in the future. Heh. Okay. Okay. I see.

And so I have so many more questions for you and posterior's but let's start and run, wrap that up because also when I ask you about another project you're working on so maybe to close that up on posterior's. What are the future plans for posterior's and are there any upcoming features or integration integrations that you can share with us? So we're quite happy with the framework at the moment.

There's lots of little tweaks that we have a list of GitHub issues that we want to go through, which are mostly and excitingly about adding new methods and new applications. So that's really what we're excited about now is actually use it in the wild and hopefully experiment all these questions that we've discussed. Yeah, like, like how we how does it make sense and how we get the benefits of Bayesian, true Bayesian inference on fine tuning or on large models or large data.

And so yeah, we are really excited and to add more methods. So if listeners have mini batch, big data Bayesian methods that we want to want to try out with a large data model, then we're hopefully accepting that we will. I do. I do like, I do promote like generality and doing it like in a way that is sort of flexible and stuff. So we may have, we may think a lot. It's not, it's not, we want to add methods that somehow feel natural and, and one way is to extend and compose with other methods.

So it might be that if we've got some very complicated last layer, requires classes just for classification method, we're probably not going to add it. So it has to be methods that stick within the posterior framework, which is this arbitrary likelihood Bayesian swappable computation. Okay. Okay. Yeah. Yeah, that makes sense. Yeah, because you have like, yeah, you have that kind of vision of wanting to do that and having that as a as a research tool, basically.

So Yeah, that makes sense to keep that under control, let's say. Something I want to ask you in the last few minutes of the show is about thermodynamic compute. I've seen you, you are working on that. And you've told me you're working on that. So yeah, I don't know anything about that. So can you like, what's that about? Yeah, so I mean, this is yeah, this is something that's very normal, normal computing. And it's like, It's something that we have. Yeah, we have this hardware team.

It's like a full stack AI company. And we, yeah, on the posterior side, on the client side, we look at how we can bring in principle Bayesian uncertainty quantification and help us solve the issues with machine learning pipelines like we've already discussed. And on the other side, there's lots of parts to this.

More just like traditional MCMC is difficult sometimes because Or just it's just like about simulating SDEs essentially as what the thermodynamic hardware is simulating SDEs Normally, you have this real pain with the step size and as the mention grows steps, let's get really small and so SDEs, where do we see SDEs?

You see SDEs in physics all the time and physics is real we can use physics so it's doing so it's building physical hardware analog hardware that We can hopefully that evolves as SDEs then we can harness that SDEs by encoding, you know, like currents and voltages and things like that. So I'm not a physicist, so I don't know exactly how it is.

But I'm always reassured at how the when I speak to the hardware team, how simple the they talk about these things, it's like, yeah, we can just stick some resistors and capacitors on a chip, and then it'll then it'll do this SDE. So this is the and then we want to use those SDEs for scientific computation. And with a real focus on statistics and machine learning. So yeah, we want to be able to do an HMC on device, on an analog device.

The first step is to do like with a linear, so we'll have a Gaussian posterior or with a linear drift in terms of this. This is an Ornstein -Ollenbeck process and we've developed hardware to do this and turns out that an Ornstein -Ollenbeck process, because it has a Gaussian stationary distribution and you have this, you can input like you can input the precision matrix and output the covariance matrix, that's matrix inversion. So, and you just, your physical device just does this.

And it's because it's an SDE, it has noise and is kind of noise aware, which is different to classical analog computation, which has historically been plagued, which is really old, really old, but historically been plagued by noise. And it's like, yeah, there's all this noise in physics. And because we're doing SDEs, we want the noise. So yeah, that's the whole idea. It's obviously very young, but it's fun. It's fun stuff. Yeah. So that's basically to... accelerate computing?

That's hardware first, so that computing is accelerated? We want to, I mean, it's a baby field. So we're trying to accelerate different components. What we worked out is with the simplest thermodynamic chip we can build is this linear chip with the Ornstein -Ullenberg process. And that can speed up with some error. some error, but it has asymptotic speed ups for linear algebra routines, so inverting a matrix or solving a linear system. That's awesome.

In this case, it would speed up a certain component, but that could be useful in a Laplace approximation or these sort of things also in machine learning. Okay, that must be very fun to work on. Do you have any writing about that that we can put in the show notes? Because I think it'd be super interesting for listeners. Yeah, yeah. We've got the normal computing scholar page has a list of papers, but we also have more accessible blogs, which I'll make sure to put in the shop.

Yeah, yeah, please do because, yeah, I think it's super interesting. And yeah, and when you have something to present on that, feel free to reach out. And I think that'd be fun to do an episode about that, honestly. That'd be great. Yeah. Yes, so maybe one last question before asking you the last two questions. Like very, like, let's do Zoom be way less technical. We've been very technical through the whole episode, which I love.

But maybe I'm thinking if you have any advice to give to aspiring developers interested in contributing to open source projects like Posterior's, what would it be? Okay, yeah, I don't know, I don't feel like I'm necessarily the best place to say all this, but yeah, I mean, I would just, the most important thing is just to go for it, just get stuck in, get in the weeds of these libraries and see what's there.

And there's loads of people building such cool stuff in the open source ecosystem and it's really fun to, honestly, it's really fun and rewarding to get involved for it. So just go for it, you'll learn so much along the way. something more tangible. I find that when I'm stuck on, starting on, it's not like I don't understand something in code or mathematics, then I often struggle to find it in papers per se.

And I find that textbooks, I love textbooks, textbooks I find as a real source of gold for these because they actually go to the depths of explaining things, without having this sort of horse in the race style writing that you often find in papers. So yeah, get stuck in check text textbooks if you, if you get lost. Or I don't understand. Or just ask as well. Open source is all about asking and communicating and bouncing ideas. Yeah, yeah, yeah, for sure. Yeah, that's usually what I do.

I ask a lot and I usually end up surrounding myself with people way smarter than me. And that's exactly what you want. That's exactly how I learned. Yeah, textbook DICI, I would say I kind of find the writing boring most of the time, depends on the textbooks. And also, it's expensive. Yeah. So that's kind of the problem of textbooks, I would say. I mean, you often can have them in PDFs, but I just hate reading the PDF on my computer.

So, you know, I wonder on the book object or having it on Kindle or something like that. But that doesn't really that doesn't really exist yet.

So. could be something that some editors solve someday that'd be cool, I'd love that awesome, Sam, that was great thank you so much, we've covered so many topics and my brain is burning so that's a very good sign I've learned a lot and I'm sure our listeners did too of course, before letting you go I'm gonna ask you the last two questions I ask every guest at the end of the show so one If you had unlimited time and resources, which problem would you try to solve?

want to decouple the model specification, the data generating process, how you go from your something you don't know to the data you do have. That's your site freedom as a data model. I have you define that from like the inference and the mathematical computation. So that's whatever, what the way you do your approximate Bayesian inference. And you want to decouple those. You want to make it as easy as possible. Ideally, we just want to be doing that one.

We just want to be doing the model specification. And this is like Stan and PyMC do this really well. It's just like, you write down your model, we'll handle the rest. And that's kind of like the dream we have as Bayesian or Bayesian software developers. And it's so with Posterior, we're trying to do something like this towards going to move towards this for bigger, big machine learning models and so bigger models, bigger data settings. So that's kind of the dream there.

But then in machine learning, what does machine learning have differently to statistics in that setting? It's like, well, machine learning models are less interesting than classical Bayesian models. The thing is they're more transferable, right? It's just a neural network, which we believe is machine learning and will solve a whole suite of tasks.

So perhaps in terms of the machine learning setting, where we decouple modeling and inference and data, you kind of want to remove the model one as well. You want to have these general purpose foundational models, you could say. So really you want to let the user focus. And so we're handling the inference. We're also handling the model. So really let the user just give it the data and say, okay, let's do this data and let's use this data to predict other things and let the user handle that.

So that's potentially like a real unlimited time and resources. Plenty of resources need to do that. But yeah, that's Sam May 2024's answer. Yeah. Yeah, that sounds... That sounds amazing. I agree with that. That's a fantastic goal. And yeah, also that reminds me, that's also why I really love what you guys are doing with Posteriorus because it's like, yeah, trying to now that we start being able to get there, making patient inference really scalable to really big data and big models.

I'm super enthusiastic about that. it would be just fantastic. So thank you so much for taking the time to do that guys. Yeah we're doing it, we're gonna get there. Yeah yeah yeah I love that. And second question, if you could have dinner with any great scientific mind dead alive or fictional who would it be? Yeah I was a bit intimidated this question. Yeah you know you ask everyone. again, it's a great question. But then I thought about it for a little bit. And it wasn't too hard for me.

I think that David Mackay is someone who, yeah, I mean, it's been amazing work. David Mackay is doing Bayesian neural networks in 1992. And that's like, yeah, like crazy before before I'm born. Anyway, Bayesian neural networks in 1992, then I've just been going through his textbook, as I said, I love textbooks, so going through his textbooks on information theory and Basin statistics is a Bayesian or was a Bayesian information theory and statistics.

And there's something that he says like right at the start of the textbook is like, one of the themes of this book is that data compression and data modeling are one and the same. And that's just really beautiful. And we talked about stream codes, which in a very information theory style setting, but it's just an auto -aggressive prediction model, just like our language model. So it's just someone else the ability to distill these informations and do these.

distill information and help the unification and be so ahead of their time. And then additionally, with a sort of like groundbreaking book on sustainable energy. So like also tackling the one of the greatest challenges we have at the moment. So yeah, that's the sustainable energy book is really wonderful. I'm one of my favorite books so far. Nice. Yeah, definitely put that in the show notes. I think. Yes, definitely. Yeah. Yeah, I'd like to keep that to read.

So Yeah, please also put that in the show and that's going to be fantastic. Great. Well, I think we can call it a show. That was fantastic. Thank you so much, Sam. I learned so much and now I feel like I have to go and read and learn about so many things. And I can definitely tell that you are extremely passionate about your doing. So yeah, thank you so much for. taking the time and being on this show? No, thank you very much. I had a lot of fun.

Yeah. Thank you for, yeah, being parcel to my rantings. I need that sometimes. Yeah, that's what the show is about. My girlfriend is extremely, extremely happy that I have this show to rent about patient stats and any nerdy stuff. Yeah, it's so true, yeah. Well, Sam, you're welcome. Anytime you need to do some nerdy rant. thank you. I'm sure I'll be... This has been another episode of Learning Bayesian Statistics. Be sure to rate, review, and follow the show on your favorite podcatcher, and

visit learnbayestats .com for more resources about today's topics, as well as access to more episodes to help you reach true Bayesian state of mind. That's learnbayestats .com. Our theme music is Good Bayesian by Baba Brinkman. Fit MC Lance and Meghiraam. Check out his awesome work at bababrinkman .com. I'm your host. Alex Andorra. You can follow me on Twitter at Alex underscore Andorra, like the country. You can support the show and unlock exclusive benefits by visiting Patreon

.com slash LearnBasedDance. Thank you so much for listening and for your support. You're truly a good Bayesian. Change your predictions after taking information in. And if you're thinking of me less than amazing, let's adjust those expectations. Let me show you how to be a good Bayesian Change calculations after taking fresh data in Those predictions that your brain is making Let's get them on a solid foundation

Transcript source: Provided by creator in RSS feed: download file