#107 Amortized Bayesian Inference with Deep Neural Networks, with Marvin Schmitt - podcast episode cover

#107 Amortized Bayesian Inference with Deep Neural Networks, with Marvin Schmitt

May 29, 20241 hr 22 minSeason 1Ep. 107
--:--
--:--
Listen in podcast apps:

Episode description

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!


In this episode, Marvin Schmitt introduces the concept of amortized Bayesian inference, where the upfront training phase of a neural network is followed by fast posterior inference.

Marvin will guide us through this new concept, discussing his work in probabilistic machine learning and uncertainty quantification, using Bayesian inference with deep neural networks. 

He also introduces BayesFlow, a Python library for amortized Bayesian workflows, and discusses its use cases in various fields, while also touching on the concept of deep fusion and its relation to multimodal simulation-based inference.

A PhD student in computer science at the University of Stuttgart, Marvin is supervised by two LBS guests you surely know — Paul Bürkner and Aki Vehtari. Marvin’s research combines deep learning and statistics, to make Bayesian inference fast and trustworthy. 

In his free time, Marvin enjoys board games and is a passionate guitar player.

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary and Blake Walters.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag ;)

Takeaways:

  • Amortized Bayesian inference...

Transcript

In this episode, Marvin Schmidt introduces the concept of amortized Bayesian inference, where the upfront training phase of a neural network is followed by fast posterior inference. Marvin will guide us through this new concept, discussing his work in probabilistic machine learning and uncertainty quantification using Bayesian inference with deep neural networks.

He also introduces Bayes' law, Python library for amortized Bayesian workflows and discusses its use cases in various fields while also touching on the concept of deep fusion and its relation to multi -model simulation -based inference. Yeah, that is a very deep episode and also a fascinating one. I've been personally diving much more into amortized Bayesian inference with Baseful since the folks there have been kind enough.

to invite me to the team, and I can tell you, this is super promising technology. A PhD student in computer science at the University of Stuttgart, Marvin is supervised actually by two LBS guests you surely know, Paul Burkner and Aki Vettelik. Marvin's research combines deep learning and statistics to make vision inference fast and trustworthy. In his free time, Marvin enjoys board games and is a passionate guitar player. This is Learning Basion Statistics, episode 107, recorded April 3, 2024.

Welcome to Learning Basion Statistics, a podcast about patient inference, the methods, the projects, and the people who make it possible. I'm your host, Alex Andorra. You can follow me on Twitter at alex .andorra, like the country, for any info about the show. LearnBasedStats .com is left last to be. Show notes, becoming a corporate sponsor, unlocking Bayesian Merch, supporting the show on Patreon, everything is in there. That's LearnBasedStats .com.

If you're interested in one -on -one mentorship, online courses, or statistical consulting, Feel free to reach out and book a call at topmate .io slash alex underscore and dora. See you around folks and best patient wishes to you all. Today, I want to thank the fantastic Adam Romero, Will Geary, and Blake Walters for supporting the show on Patreon. Your support is truly invaluable and literally makes this show possible. I can't wait to talk with you guys in the Slack channel.

Second, the first part of our modeling webinar series on Gaussian processes is out for everyone. So if you want to see how to use the new HSGP approximation in PIMC, head over to the LBS YouTube channel and you'll see Juan Orduz, a fellow PIMC Core Dev and mathematician, explain how to do fast and efficient Gaussian processes in PIMC. I'm actually working on the next part in this series as we speak, so stay tuned for more and follow the LBS YouTube channel if you don't want to miss it.

Okay, back to the show now. Marvin Schmidt, Willkommen nach Learning Patient Statistics. Thanks Alex, thanks for having me. Actually my German is very rusty, do you say nach or zu? Well, welcome Learning Patient Statistics. Maybe welcome in podcast? Nah. Obviously, obviously like it was a third hidden option. Damn. it's a secret third thing, right? Yeah, always in Germany. It's always that. Man, damn. Well, that's okay.

I got embarrassed in front of the world, but I'm used to that in each episode. So thanks a lot for taking the time. Marvin. Thanks a lot to Matt Rosinski actually for recommending to do an episode with you. Matt was kind enough to take some of his time to write to me and put me in contact with you. I think you guys met in Australia in a very fun conference based on the beach. I think it happens every two years. Definitely when I go there in two years and do a live episode there.

Definitely that's a... That's a product I wanted to do that this year, but that didn't go well with my traveling dates. So in two years, definitely going to try to do that. So yeah, listeners and Marvin, you can help me accountable on that promise. Absolutely. We will. So Marvin, before we talk a bit more about what you're a specialist in and also what you presented in Australia, can you tell us what you're doing nowadays and also how you... Andy Depp working on this? Yeah, of course.

So these days, I'm mostly doing methods development. So broadly in probabilistic machine learning, I care a lot about uncertainty quantification. And so essentially, I'm doing Bayesian inference with deep neural networks.

So taking Bayesian inference, which is notoriously slow at times, which might be a bottleneck, and then using generative neural networks to speed up this process, but still maintaining all the explainability, all these nice benefits that we have from using I have a background in both psychology and computer science. That's also how I ended up in, Beijing inference.

cause during my psychology studies, I took a few statistics courses, then started as a statistics tutor, mainly doing frequent statistics. And then I took a seminar on Beijing statistics in Heidelberg in Germany. and it was the hardest seminar that ever took. Well, it's super hard. We read like papers every single week. Everyone had to prepare every single paper for every single week. And then at the start of each session, the professor would just shuffle and randomly pick someone to prison.

my God. That was tough, but somehow, I don't know, it stuck with me. And I had like this aha moment where I felt like, okay, all this statistics stuff that I've been doing before was more of, you know, following a recipe, which is very strict. But then this like holistic Bayesian probabilistic take. just gave me a much broader overview of statistics in general. Somehow I followed the path. Yeah. I'm curious what that...

So what does that mean to do patient stats on deep neural network concretely? What is the thing you would do if you had to do that? Let's say, does that mean you mainly... you develop the deep neural network and then you add some Bayesian layer on that, or you have to have the Bayesian framework from the beginning. How does that work? Yeah, that's a great question.

And in fact, that's a common point of confusion there as well, because Bayesian inference is just like a general, almost philosophical framework for reasoning about uncertainty. So you have some latent quantities, call them parameters, whatever, some latent unknowns. And you want to do inference on them. You want to know what these latent quantities are, but all you have are actual observables. And you want to know how these are related to each other.

And so with Bayesian neural networks, for instance, these parameters would be the neural network weights. And so you want full Bayesian inference on the neural network weights. And fitting normal neural networks already supports that. Like a Bixarity distribution. Exactly. Over these neural network weights. Exactly. So that's one approach of doing Bayesian deep learning, but that's not what I'm currently doing. Instead, I'm coming from the Bayesian side.

So we have like a normal Bayesian model, which has statistical parameters. So you can imagine it like a mechanistical model for like a simulation program. And we want to estimate these scientific parameters.

So for example, if you have a cognitive decision -making task from the cognitive sciences, and these parameters might be something like the non -decision time, the actual motor reaction time that you need to move your muscles and some information uptake rates, some bias and all these things that researchers are actually interested in.

And usually you would then formulate your model in, for example, PiMC or Stan or however you want to formulate your statistical model and then run MCMC for parameter inference. And now where the neural networks come in in my research is that we replace MCMC with a neural network. So we still have our Bayesian model. But we don't use MCMC for posterior inference. Instead, we use a neural network just for posterior inference. And this neural network is trained by maximum likelihood.

So the neural network itself, the weights there are not probabilistic. There are no posterior distributions over the weights. But we just want to somehow model the actual posterior distributions of our statistical model parameters using a neural network. So the neural net, I think so. That's quite new to me. So I'm going to rephrase that and see how much I understood. So that means the deep neural network is already trained beforehand? No, we have to train it.

And that's the cool part about this. OK, so you train it at the same time. You train it at the same time. You're also trying to infer the underlying parameters of your model. And that's the cool part now. Because in MCMC, you would do both at the same time, right? You have your fixed model that you write down in PyMC or Stan, and then you have your one observed data set, and you want to fit your model to the data set.

And so, you know, you do, for example, your Hamiltonian Monte Carlo algorithm to, you know, traverse your parameter space and then do the sampling. So you couple your approximating phase and your inference phase. Like you learn about the posterior distribution based on your data set. And then you also want to generate posterior samples while you're exploring this parameter space. And in the line of work that I'm doing, which we call amortized Bayesian inference, we decouple those two phases.

So the first phase is actually training those neural networks. And that's the hard task. And then you essentially take your Bayesian model. generate a lot of training data from the model because you can just run prior predictive samples. So generate prior predictive samples. And those are your training data for the neural network. And use the neural network to essentially learn surrogate for the posterior distribution.

So for each data set that you have, you want to take those as conditions and then have a generative neural network to learn somehow how these data and the parameters are related to each other. And this upfront training phase takes quite some time and usually takes longer than the equivalent MCMC would take, given that you can run MCMC. Now, the cool thing is, as you said, when your neural network is trained, then the posterior inference is super fast.

Then if you want to generate posterior samples, there's no approximation anymore because you've already done all the approximation. So now you're really just doing sampling. That means just generating some random numbers in some latent space and having one pass through the neural network, which is essentially just a series of matrix multiplications.

So once you've done this hard part and trained your generative neural network, then actually doing the posterior sampling takes like a fraction of a second for 10 ,000 posterior samples. Okay, yeah, that's really cool. And how generalizable is your deep neural network then? Do you have like, is that, because I can see the really cool thing to have a neural network that's customized to each of your models. That's really cool.

But at the same time, as you were saying, that's really expensive to train a neural network each time you have to sample a model. And so I was thinking, OK, so then maybe what you want is have generalized categories of deep neural network. So that would probably be another kill. But let's say I have a deep neural network for linear regressions. Whether they are generalized or just plain normal likelihood, you would use that deep neural network for linear regressions.

And then the inference is super fast, because you only have to train. the neural network once and then inference, posterior inference on the linear regression parameters themselves is super fast. So yeah, like that's a long question, but did you get what I'm asking? Yeah, absolutely. So if I get your question right, now you're asking like, if you don't want to run linear regression, but want to run some slightly different model, can I still use my pre -trained neural network to do that?

Yes, exactly. And also, yeah, like in general, how does that work? Like, how are you thinking about that? Are there already some best practices or is it like really for now, really cutting edge research that and all the questions are in the air? Yeah. So first of all, the general use case for this type of amortized Bayesian inference is usually when your model is fixed, but you have many new datasets. So assume you have some quite complex model where MCMC would take a few minutes to run.

And so instead for one fixed data set that you actually want to sample from. And now instead of running MCMC on it, you say, okay, I'm going to train this neural network. So this won't yet be worth it for just one data set. Now the cool thing is if you want to keep your actual model, so whatever you write down in PyMC or Stan, We want to keep that fixed, but now plug in different data sets. That's where amortized inference really shines.

So for instance, there was this one huge analysis in the UK where they had like intelligence study data from more than 1 million participants. And so for each of those participants, they again had a set of observations. And so for each of those 1 million participants, They want to perform posterior inference. It means if you want to do this with something like MCMC or anything non -amortized, you would need to fit one million models.

So you might argue now, okay, but you can parallelize this across like a thousand cores, but still that's, that's a lot. That's a lot of control. Now the cool thing is the model was the same every single time. You just had a million different data sets. And so what these people did then is train a neural network once. And then like it will train for a few hours, of course, but then you can just sequentially feed in all these 1 million data sets.

And for each of these 1 million data sets, it takes way, way less than one second. to generate tens of thousands of posterior samples. But that didn't really answer your question. So your question was about how can we generalize in the model space? And that's a really hard problem because essentially what these neural networks learn is to give you some posterior function if you feed in a data set.

Now, if you have a domain shift in the model space, so now you want inference based on a different model, and this neural network has never learned to do that. So that's tough. That's a hard problem. And essentially what you could do and what we are currently doing in our research, but that's cutting edge, is expanding the model space. So you would have a very general formulation of a model and then try to amortize over this model.

So that different configurations of this model, different variations. could just be extracted special case model essentially. Can you take an example maybe to give an idea to listeners how that would work? Absolutely. We have one preprint about sensitivity -aware amortized Bayesian inference. What we do there is essentially have a kind of multiverse analysis built into the neural network training.

give some background, multiverse analysis, basically says, okay, what are all the pre -processing steps that you could take in your analysis? And you encode those. And now you're interested in like, what if, what if I had chosen a different pre -processing technique? What if I had chosen a different way to standardize my data? Then also the classical like prior sensitivity or likelihood sensitivity analysis. Like what happens if I do power scaling on my prior? power scaling on my posterior.

So we also encode this. What happens if I bootstrap some of my data or just have a perturbation of my data? What if I add a bit of noise to my data? So these are all slightly different models. What we do essentially keep track of that during the training phase and just encode it into a vector and say, well, okay, now we're doing pre -processing choice number seven.

and scale the prior to the power of two, don't scale the likelihood and don't do any perturbation and feed this as an additional information into the neural network. Now the cool thing is during inference phase, once we're done with the training, you can say, hey, here's a data set. Now pretend that we chose pre -processing technique number 11 and prior scaling of power 0 .5. What's the posterior now?

Because we've amortized over this large or more general model space, we also get valid posterior inference if we've trained for long enough over these different configurations of model. And essentially, if you were to do this with MCMC, for instance, you would refit your model every single time. And so here you don't have to do that. Okay. Yeah, I see. That's super. Yeah, that's super cool.

And I feel like, so that would be mainly the main use cases would be as you were saying, when, when you're getting into really high data territory and you have what's changing is mainly the data side, mainly the data. set and to be even more precise, not really the data set, but the data values, because the data set is supposed to be like quite the same, like you would have the same columns, for instance, but the values of the columns would change all the time.

And the model at the same time doesn't change. Is that like, that's really for now, at least the best use case for that kind of method. Yes. And this might seem like a very niche case. But then if you look at like, Bayesian workflows in practice, this topic of this scheme of many model research doesn't necessarily mean that you have a large number of data sets. This might also just mean you want extensive cross validation. So assume that you have one data set with 1000 observations.

Now you want to run leaf1 or cross validation, but for some reason you can't do the Pareto Smooth importance sampling version, which would be much faster. So you would need 1000 model refits, even though you just have one data set, because you want 1000 cross validation refits. Maybe can you explicit what your meaning by cross validation here? Because that's not a term that's used a lot in the patient framework, I think. Yeah, of course.

So especially innovation setting, there's this approach of leave one out cross validation, where you would fit your posterior based on all data points, but one. And that's why it's called leave one out, because you take one out and then fit your model, fit your posterior on the rest of the data. And now you're interested in the posterior predictive performance of this one left out observation. Yeah. And that's called cross validation. Yeah. Go ahead.

Yeah, no, just I'm going to let you finish, but yeah, for listeners familiar with the frequented framework, that's something that's really heavily used in that framework, cross validation. And it's very similar to the machine learning concept of cross validation.

But in the machine learning area, you would rather have something like fivefold in general, k -fold cross validation, where you would have larger splits of your data and then use parts of your whole dataset as the training dataset and the rest for evaluation. Essentially, like the one across relation just puts it to the extreme. Everything but one data point is your train dataset. Yeah. Yeah. Okay. Yeah. Damn, that's super fun.

And is there, is there already a way for people to try that out or is it mainly for now implemented for papers? And you are probably. I'm guessing working on that with Aki and all his group in Finland to make that more open source, helping people use packages to do that. What's the state of the things here? Yeah, that's a great question. And in fact, the state of usable open source software is far behind what we have for likelihood -based MCMC based inference.

So we currently don't have something that's comparable to PyMC or Stan. Our group is developing or actively developing a software that's called Base Flow. That's because like the name, because like base, because we're doing Bayesian inference. And essentially the first neural network architecture that was used for this amortized Bayesian inference are so -called normalizing flows. Conditional normalizing flows to be precise. And that's why the name Base Flow came to be. But now.

actually have a bit of a different take because now we have a whole lot of generative neural networks and not only normalizing flows. So now we can also use, for example, score -based diffusion models that are mainly used for image generation and AI or consistency models, which are essentially like a distilled version of score -based diffusion models. And so now baseflow doesn't really capture that anymore.

But now what the baseflow Python library specializes in is defining Principled amortized Bayesian workflows. So the meaning of base or slightly shifted to amortized Bayesian workflows and hence the name base login And the focus of base slope and the aim of base low is twofold So first we want a library. It's good for actual users So this might be researchers who just say hey, here's my data set. Here's my model my simulation program and Please just give me fast posterior samples.

So we want usable high level interface with sensible default values that mostly work out of the box and an interface that's mostly self -explanatory. Also of course, good teaching material and all this. But that's only one side of the coin because the other large goal of FaceFlow is that it should be usable for machine learning researchers who want to advance amortized Bayesian inference methods as well. And so the software in general, is structured in a very modular way.

So for instance, you could just say, hey, take my current pipeline, my current workflow. But now try out a different loss function because I have a new fancy idea. I want to incorporate more likelihood information. And so I want to alter my loss function. So you would have your general program because of the modular architecture there, you could just say, take the current loss function and replace it with a different one. that is used to the API.

And we're trying to doing both and serving both interests, user friendly side for actually applied researchers who are also currently using Baseflow. But then also the machine learning researchers with completely different requirements for this piece of software. Maybe we can also use Baseflow documentation and the current project website in the notes. Yeah, we should definitely do that. Definitely gonna try that out myself. It sounds like fun.

I need a use case, but as soon as I have a use case, I'm definitely gonna try that out because it sounds like a lot of fun. Yeah, several questions based on that and thanks a lot for being so clear and so detailed on these. So first, we talked about normalizing flows in episode 98 with Marie -Lou Gabriel. Definitely recommend listeners to listen to that for some background. And question, so Baseflow, yeah, definitely we need that in the show notes and I'm going to install that in my environment.

And I'm guessing, so you're saying that that's in Python, right? The package? Yes, the core package is in Python and we're currently refactoring to Keras. So by the time this podcast episode is aired, we will have a new major release version, hopefully. OK, nice. So you're agnostic to the actual machine learning back end.

So then you could choose TensorFlow, PyTorch, or JAX, whatever integrates best with what you're currently proficient in and what you might be currently using in other parts of a project. OK, that was going to be my question. Because I think while preparing for the episode, I saw that you were mainly using PyTorch. So that was going to be my question. What is that based on? So the back end could be PyTorch, JAX, or. What did you think the last one was? Tansor flow.

Yeah, I always forget about all these names. I really know PyTorch. So that's why I like the other ones. And JAX, of course, for PyMC. And then, so my question is, the workflow, what would it look like if you're using Baseflow? Because you were saying the model, you could write it in standard PyMC or TensorFlow, for instance. Although I don't know if you can write. patient models with TensorFlow anymore. Anyways, let's say PyMC or Stan. You write your model.

But then the sampling of the model is done with the neural network. So that means, for instance, PyTorch or Jax. How does that work? Do you have then to write the model in a Jax compatible way? Or is the translation done by the package itself? Yeah, that's a great question. It touches on many different topics and considerations and also on future roadmap for bass flow.

So. This class of algorithms that are implemented in Baseflow, these amortized Bayesian inference algorithms, to give you some background there, they originally started in simulation -based inference. It's also sometimes called likelihood -free inference. So essentially it is Bayesian inference when you don't bring a closed -form likelihood function to the table. But instead, you only have some generic forward simulation program. So you would just have your prior as some...

Python function or C++ function, whatever, any function that you could call and it would return you a sample from the prior distribution. You don't need to write it down in terms of distributions actually, but you only need to be able to sample from it. And then the same for the likelihood. So you don't need to write down your likelihood in like a PMC or Stan in terms of a probability distribution, in terms of density distribution or densities. But instead it's.

just got to be some simulation program, which takes in parameters and then outputs data. What happens between these parameters and the data is not necessarily probabilistic in terms of closed form distributions. It could also be some non -tractable differential equations. It could be essentially everything.

So for base flow, this means that you don't have to input something like a PMC or a Stan model, which you write down in terms of distributions, but it's just a generic forward model that you can call and you will get a tuple of a parameter draw and a data set. So you'd usually just do it in NumPy. So you would write, if I'm using Baseflow, I would write it in NumPy. It would probably be the easiest way.

You could probably also write it in JAX or in PyTorch or in TensorFlow or TensorFlow probability, whatever you want to use and like behind the scenes. But essentially what we just care about is that the model gets a tuple of parameters and then data that has been generated from these parameters. for the neural network training process. That's super fun. Yeah, yeah, yeah. Definitely want to see that. Do you have already some Jupyter notebook examples up on the repo or are you working on that?

Yeah, currently it's a full -fledged library. It's been under development for a few years now. And we also have an active user base right now. It's quite small compared to other Bayesian packages. We're growing it. Yeah, that's cool. In documentation, there are currently, I think, seven or eight tutorial notebooks. And then also for a Based on the Beach, like this conference in Australia that we just talked about earlier, we also prepared a workshop.

And we're also going to link to this Jupyter notebook in the show notes. Yeah, definitely we should, we should link to some of these Jupyter notebooks in the show notes. And Sean, I'm thinking you should... Like if you're down, you should definitely come back to the show, but for a webinar. I have another format that's modeling webinar where you could, you would come to the show and share your screen and, and go through the model code live and people can ask questions and so on.

I've done that already on a variety of things. Last one was about causal inference and propensity scores. Next one is going to be on about helper space GP decomposition. So yeah, if you're down, you should definitely come and do a demonstration of base flow and amortized Bayesian inference. I think that would be super fun and very interesting to people. Absolutely. Then to answer the last part of your question.

Yeah. Like if you currently have a model that's written down in PyMC or Stan, that's a bit more tricky to integrate because essentially what all we need in base flow are samples from the prior predictive distribution. If you talk in Bayesian terminology. Yeah. And if your current model can do that, that's fine. That's all you need right now. And then base build builds.

You can have like a PIMC model and just do pm .sample -properative, save that as a big NumPy multidimensional array and pass that to baseflow. Yes. Okay. Just all you need are two builds of the ground truth parameters of the data training process. So essentially like the result of your prior call and then the result of your likelihood call with those prior parameters. So you mean what the likelihood samples look like once you fix the prior parameters to some value?

Yes. So like in practice, you would just call your prior function. Yeah. Then get a sample from the prior. So parameter vector. Yeah. And then plug this parameter vector into the likelihood function. And then you get one simulated synthetic data set. And you just need those two. Okay. Super cool. Yeah. Definitely sounds like a lot of fun and should definitely do a webinar about that. I'm very excited about that. Yeah. Fantastic. And so that was one of my main questions on that.

Other question is, I'm guessing you are a lot of people working on that, right? Because your roadmap that you just talked about is super big. Because having a package that's designed for users, but also for researchers is quite, that's really a lot of work. So I'm hoping you're not allowed doing that. No, we're currently a team of about a dozen people. No, yeah, that makes sense. It's an interdisciplinary team.

So like a few people with a hardcore like software engineering background, like some people with a machine learning background, and some people from the cognitive sciences and also a handful of physicists. Because in fact, these amortized Bayesian inference methods are particularly interesting for physicists. Example for astrophysicists who have these gravitational wave inference problems where they have massive data sets. And running MCMC on those would be quite cumbersome.

So if you have this huge in -stream data and you don't have this underlying likelihood density, but just some simulation program that might generate sensible, like gravitational waves, then amortized Bayesian inference really shines there. Okay. So that's exactly the case you were talking about where the model doesn't change, but you have a lot of different datasets. Yeah, exactly. Because I mean, what you're trying to run inference on is your physical model. And that doesn't change.

I mean, it does. And then again, physicists have a very good understanding and very good models of the world around them. And that's made one of the largest differences. people from the cognitive sciences, where, you know, the, the models of the human brain, for instance, are just, it's such a tough thing to model and there's so much not there and so much uncertainty in the model building process. Yeah, for sure. Okay, yeah, I think I'm starting to understand the idea.

And yeah, so actually, episode 101 was exactly about that. Black holes, collisions, gravitational waves. And I was talking with LIGO researchers, Christopher Perry and John Vich. And we talked exactly about that, their problem with big data sets. They are mainly using sequential Monte Carlo, but I'm guessing they would also be interested in a Monte... amortized Bayesian inference. So yeah, Christopher and John, if you're listening, if you're future reach out to Marvin and use Baseflow.

And listeners, this episode will be in the show notes also if you want to give it a listen. That's a really fun one also learning a lot of stuff, but the crazy universe we live in. Actually, a weird question I have is why easy to call it amortized Bayesian inference. The reason is that we have this two -stage process where we would first pay upfront with this long neural network training phase.

But then once we're done with this, this cost of the upfront training phase amortizes over all the posterior samples that we can draw within a few milliseconds. That makes sense. That makes sense. And so I think something you're also working on is something that's called deep fusion. And you do that in particular for multimodal simulation -based inference. How is that related to amortized patient inference, if at all? And what is it about? I'm gonna answer these two questions in reverse order.

So first about the relation between simulation -based inference and amortized Bayesian inference. So to give you a bit of history there, simulation -based inference essentially Bayesian inference based on simulations where we don't assume that we have access to a likelihood density, but instead we just assume that we can sample from the likelihood. Essentially simulate from the model. In fact, the likelihood is still.

present, but it's only implicitly defined and we don't have access to the density. That's why likelihood -free inference doesn't really hit what's happening here. But instead, like in the recent years, people have started adopting the term simulation -based inference because we do Bayesian inference based on simulations instead of likelihood densities. So methods that have been used... for quite a long time now in the simulation -based inference research area.

For example, rejection ABC, so approximate Bayesian computation, or then ABC SMC, so combining ABC with sequential Monte Carlo. Essentially, the next iteration there was throwing neural network at simulation -based inference. That's exactly this neural posterior estimation that I talked about earlier.

And now what researchers noticed is, hey, when we train a neural network for simulation -based inference, instead of running rejection, approximate base computation, then we get amortization for free as a site product. It's just a by -product of using a neural network for simulation -based inference. And so in the last maybe four to five years, People have mainly focused on this algorithm that's called neuro posterior estimation for simulation based inference.

And so all developments that happened there and all the research that happened there, almost all the research, sorry, focused on cases where we don't have any likelihood density. So we're purely in the simulation based case. Now with our view of things, when we come from a Bayesian inference, like likelihood based setting, can say, hey, amortization is not just a random coincidental byproduct, but it's a feature and we should focus on this feature.

And so now what we're currently doing is moving this idea of amortized Bayesian inference with neural networks back into a likelihood -based setting. So we've started using likelihood information again. For example, using likelihood densities if they're available or learning information about the likelihood. So like a surrogate model on the fly, and then again, using this information for better posterior inference.

So we're essentially bridging simulation -based inference and likelihood -based Bayesian inference again with this goal, a larger goal of amortization if we can do it. And so this work on deep fusion. essentially addresses one huge shortcoming of neural networks when we want to use them for amortized Bayesian inference. And that is in situation where we have multiple different sources of data.

So for example, Imagine you're a cognitive scientist and you run an experiment with subjects and for each test subject, you give them a decision -making task. But at the same time, while your subjects solve the decision -making task, you wire them up with an EEG to measure the brain activity. So for each subject across maybe 100 trials, what you now have is both an EEG and the data from the decision -making task.

Now, if you want to analyze this with PyMC or Stan, what you would just do is say, hey, well, we have two data -generating processes that are governed by a set of shared parameters. So the first part of the likelihood would just be this we -know process for the decision -making task where you just model the reaction time. fairly standard procedure there in the cognitive science.

And then for the second part, we have a second part of the likelihood that we evaluate that somehow handles these EEG measurements. For example, a spatial temporal process or just like some summary statistics that are being computed there. However, you would usually compute your EEG. Then you add both to the log PDF of the likelihood, and then you can call it a day.

You cannot do that in neural networks because you have no straightforward sensible way to combine these reaction times from the decision -making task and the EEG data. Because you cannot just take them and slap them together. They are not compatible with each other because these information data sources are heterogeneous. So you somehow need a way to fuse these sources of information. so that you can then feed them into the neural network.

That's essentially what we're studying in this paper, where you could just get very creative and have different schemes to fuse the data. So you could use these attention schemes that are very hip in large language models right now with transformers essentially, and have these different data sources attend or listen essentially to each other. With cross attention, you could just let the EEG data inform your decision -making data or just have the decision -making data inform the EEG data.

So you can get very creative there. You could also just learn some representation of both individually, then concatenate them and feed them to the neural network. Or you could do very creative and weird mixes of all those approaches. And in this paper, we essentially have a systematic investigation of these different options. And we find that the most straightforward option works the best.

overall, and that's just learning fixed size embeddings of your data sources individually, and then just concatenating them. It turns out then we can use information from both sources in an efficient way, even though we're doing inference with neural networks. And maybe what's interesting for practitioners is that we can compensate for missing data in individual sources.

And the paper we essentially, we induced missing data by just taking these EEG data and decision -making data and just randomly dropping some of them. And the neural networks have learned, like when we do this fusion process, the neural networks learn to compensate for partial missingness in both sources. So if you just remove some of the decision -making data, the neural network learn to use the EEG data to inform your posterior.

Even though the data and one of the sources are missing, the inference is pretty robust then. And again, all this happens without model refits. So you would just account for that during training. Of course you have to do this like random dropping of data during a training phase as well. And then you can also get it during the inference phase. yeah, that sounds, yeah, that's really cool. Maybe that's a bit of a, like a small piece of this paper in our larger roadmap.

This is essentially taking this amortized vision inference. up to the level of trustworthiness and robustness and all these gold standards that we currently have for likelihood -based inference in PMC or Stan. Yeah. Yeah. And there's still a lot of work to do because of course, like there's no free lunch. and, and of course there are many problems with trustworthiness. And that's also one of the reasons why I'm here with Aki right now.

cause Aki is so great at Bayesian workflow and trustworthiness, good diagnostics. That's all, you know, all the things that we currently still need for trustworthy, amortized Bayesian inference. Yeah. So maybe you want to. talk a bit more about that and what you're doing on that. That sounds like something very interesting. So one huge advantage of an amortized Bayesian sampler is that evaluations and diagnostics are extremely cheap.

So for example, there's this gold standard method that's called simulation based calibration, where you would sample from your model and then like a sample from your prior predictive space and then refit your model and look at your coverage, for instance. In general, look at the calibration of your model on this potentially very large prior predictive space. So you naturally need many model refits, but your model is fixed.

So if you do it with MCMC, it's a gold standard evaluation technique, but it's very expensive to run, especially if your model is complex. Now, if you have an amortized estimator, simulation -based calibration on thousands of datasets takes a few seconds.

So essentially, and that's my goal for this research visit with Aki here in Finland, is trying to figure out what are some diagnostics that are gold standard, but potentially very expensive, up to a point where it's infeasible to run on a larger scale with MCMC. But we can easily do it with an amontized estimator. With the goal of figuring out, like, can we trust this estimator? Yes or no?

It's like, as you might know from neural networks, we just have no idea what's happening inside their neural network. And so we currently don't have these strong diagnostics that we have for MCMC. Like for example, our head. There's no comparable thing for neural network. So one of my goals here is to come up with more good diagnostics that are either possible with MCMC, but very expensive so we don't run them, but they would be very cheap with an amortized estimator.

Or the second thing just specific to an amortized estimator, just like our head is specific to MCMC. Okay. Yeah, I see. Yeah, that makes tons of sense. well. And actually, so I would have more technical questions on these, but I see the time running out. I think something I'm mainly curious about is the challenges, the biggest challenges you face when applying amortized spatial inference and diffusion techniques in your projects, but also like in the projects you see.

I think that's going to also give a sense to listeners of when and where to use these kinds of methods. That's a great question. And I'm more than happy to talk about all these challenges that we have because there's so much room for improvement because like these Amortized methods, they have so much potential, but we still have a long way to go until they are as usable and as straightforward to use as current MCMC samplers.

And in general, one challenge for practitioners, is that we have most of the problems and hardships that we have in PyMC or Stan. And that is that researchers have to think about their model in a probabilistic way, in a mechanistic way. So instead of just saying, hey, I click on t -test or linear regression in some graphical user interface, they actually have to come up with a data generating process. and have to specify their model.

And this whole topic of model specification is just the same in amortized workflow because some way we need to specify the Bayesian model. And now on top of all this, we have a huge additional layer of complexity and this is defining the neural networks. And amortized Bayesian inference, nowadays we have two neural networks. The first one is a so -called summary network. which essentially learns a latent embedding of the data set.

Essentially those are like optimal learned summary statistics and optimal doesn't mean that they have to be optimal to reconstruct the data, but instead optimal means they're optimal to inform the posterior. for example, in a very, very simple toy model, if you have just like a Gaussian model and you just want to perform inference on the mean. then a sufficient summary statistic for posterior inference on the mean would be the mean. Because that's all you need to reconstruct the mean.

It sounds very tautological, but yeah. Then again, the mean is obviously not enough to reconstruct the data because all the variance information is missing. What the summary network learns is something like the mean. So summary statistics that are optimal for posterior inference. And then the second network is the actual generative neural network. So like a normalizing flow, score -based diffusion model, consistency model, flow matching, whatever condition generative model you want.

And this will handle the sampling from the posterior. And these two networks are learned end to end. So you would learn your summary statistic, output it, feed it into the posterior network, the generative model, and then have one. evaluation of the loss function, optimize both end to end. And so we have two neural networks, long story short, which is substantially harder than just hitting like sample on a PMC or Stan program. And that's an additional hardship for practitioners.

Now in Baseflow, what we do is we provide sensible default values for the generative neural networks, which work in maybe like 80 or 90 % of the cases.

It's just sufficient to have, for example, like a NeuroSpline flow, like some sort of normalizing flow with, I don't know, like, six layers and a certain number of units, some regularization for robustness and, you know, cosine decay of the learning rates, and all these machine learning parts, we try to take them away from the user if they don't want to mess with it.

But still, if things don't work, they would need to somehow diagnose the problems and then, you know, play with the number of layers and this neural network architecture. And then for the summary network, the summary network essentially needs to be informed by the data. So if you have time series, you would look at something like an LSTM. So these like long short time memory time series neural networks. Or you would have like recurrent neural network or nowadays a time series transformer.

They're also called temporal fusion transforms. If you have IID data, you would have something like a deep set or a set transformer, which respect this exchangeable structure of the data. So again, we can give all the recommendations and sensible default values like If you have a time series, try a time series transformer. Then again, if things don't work out, users need to play around with these settings. So that's definitely one hardship of armatized Bayesian inference in general.

And for the second part of your question, hardships of this deep fusion. It's essentially if you have more and more information sources, then things can get very complicated. Example, just a few days ago, we discussed about a case where someone has 60 different sources of information and they're all streams of time series. Now we could say, hey, just slap 60 summary networks on this problem, like one summary network for each domain.

That's going to be very complex and very hard to train, especially if we don't bring that many data sets to the table for the neural network training. And so there we somehow need to find a compromise. Okay, what information can we condense and group together? So maybe some of the time series sources are somewhat similar and actually compatible with each other. So we could, for example, come up with six groups of 10 time series each.

Then we would only need six neural networks for the summary embeddings and all these practical considerations. That makes things just like as hard as in likelihood based MCMC based inference, but just a bit harder because of all the neural network stuff that's happening. Did this address your question? Yeah. Yeah. It gives me more questions, but yeah, for sure. That does answer the question.

When you're talking about transformer for time series, are you talking about the transformers, the neural network that's used in large language models or is it something else? It's essentially the same, but slightly adjusted for time series so that the... statistics or these latent embeddings that you output still respect the time series structure where typically you would have this autoregressive structure.

So it's not exactly the same like standard transformer, but you would just enrich it to respect the probabilistic structure in your data. But at the core, it's just the same. So at the core, it's an attention mechanism, like multi -head attention where Like the different parts of your dataset could essentially talk or listen to each other. So it's just the same. Okay. Yeah, that's interesting. I didn't know that existed for time series. That's interesting.

That means, so because the transformer takes like one of the main thing is you have to tokenize the inputs. Right? So here you would tokenize like that there is a tokenization happening of the time series data. You don't have to tokenize here because the reason why you have to tokenize. in large language models or natural language processing in general is that you want to somehow encode your characters or your words?

into like a into numbers essentially and we don't need that in Bayesian inference in general because we already have numbers Yeah So our data already comes in numbers, so we don't need tokenization here. Of course if we had text data Then we would need tokenization. Yeah. Yeah. Yeah. OK. OK. Yeah, it makes more sense to me. All right, that's fun. I didn't know that existed. Do you have any resources about transformer for time series that we could put in the show notes? Absolutely.

There is a paper that's called Temporal Fusion Transformers, I think. I will send you the link. yeah. Awesome. Yeah, thanks. Definitely. We have this time series transformer, temporary fusion transformer implemented in base flow. So now it's just like a very usable interface where you would just input your data and then you get your latent embeddings. You can say like, I want to input my data and I want as an output 20 learned summary statistics. So that's all you need to do there.

Okay. And you can go crazy. So what would you do with it? Good. Yeah, what would you do with these results? Basically the outputs of the transformer, what would you use that for? Those are the learned summary statistics. That you would then treat as a compressed fixed length version of your data for the posterior network for this generative model. So then you use that afterwards in the model? Exactly.

Yeah. So the transformer is just used to learn summary statistics of the data sets that we input. For instance, if you have time series, like we did this for COVID time series. If you have a COVID time series, worth like for a three year period would be and daily reporting, you would have a time series with about a thousand time steps. That's quite long as a condition into a neural network to pass in there.

And also like if now you don't have a thousand days, but a thousand and one days, then the length of your input to the neural network would change and your neural network wouldn't do that. So what you do with a time series transformer is compress this time series of maybe 1 ,000 or maybe 1 ,050 time steps into a fixed length vector of summary statistics. Maybe you extract 200 summary statistics from that. Hey, okay, I see.

And then you can use that in your neural network, in the model that's going to be sampling your model. In the neural network that's going to be sampling your model. We already see that we're heavily overloading terminology here. So what's a model actually? So then we have to differentiate between the actual Bayesian model that we're trying to fit. And then the neural network, the generative model or generative neural network that we're using as a replacement for MCMC.

So it's, it's a lot of this taxonomy that's, that's odd when you're at the interface of deep learning and statistics. Another one of those hiccups are parameters. Like invasion inference parameters are your inference targets. So you want posterior distributions on a handful of model parameters. When you talk to people from deep learning about parameters, they understand the neural network weights.

So sometimes you have to be careful with the, I have to be careful with the terminology and words used to describe things because we have different types of people going on different levels of abstraction here in different functions. Yeah. Yeah, exactly. So that means in this case, it's the transformer takes in time values, it summarizes them. And it passed that on to the neural network that's going to be used to sample the patient model. Exactly.

And they are passed in as the conditions, like conditional probability, which totally makes sense because like this generative neural network, it learns the distribution of parameters conditional on the data or summary statistics of the data. So that's the exact definition of the Bayesian posterior distribution. Like a distribution of the Bayesian model parameters conditional on the data. It's the exact definition of the posterior. Yeah, I see. And that means...

So in this case, yeah, no, I think my question was going to be, so why would you use these kind of additional layer on the time series data? But you have to answer that. Is that, well, what if your time series data is too big or something like that? Exactly. It's not just being too big, but also just a variable length. Because the neural network, like the generative neural network, it always wants fixed length inputs.

Like it can only handle, in this case of the COVID model, it could only handle input conditions with length 200. And now the time series transformer takes part, so the time series transformer handles the part that our actual raw data have variable length. And time series transformers can handle data of variable length. So they would, you know, just take a time series of length. maybe 500 time steps to 2000 time steps, and then always compress it to 200 summary statistics.

So this generative neural network, which is much more strict about the shapes and form of the input data, will always see the same length inputs. Yeah. Okay. Yeah, I see. That makes sense. Awesome. Yeah, super cool. And so as you were saying, this is already available in base flow, people can use this kind of transformer for time series. Yeah, absolutely. For time series and also for sets. So for IID data.

Yeah. Because if you just fed, if you just take an IID data set and input into a neural network, the neural network doesn't know that your observations are exchangeable. So it will assume much more structure than there actually is in your data. So again, it has a double function, like a dual function of like compressing data, encoding the probabilistic structure of the data, and also outputting a fixed representation. So this would be a set transformer or deep set is another option.

It's also implemented in Baseflow. Super cool. Yeah. And so let's start winding down here because I've already taken a lot of your time. Maybe a last few questions would be what are some emerging topics that you see within deep learning and probabilistic machine learning that you find particularly intriguing? Because I've been to talk here a lot about really the nitty -gritty, the statistical detail. And so on, but now if we do zoom a bit and we start thinking about more long -term.

Yeah. I'm very excited about two large topics. The first one are generative models that are very expressive. So unconstrained neural network architectures, but at the same time have a one -step inference. So for example, people have been using score -based diffusion models a lot for flow matching. for image generation, like for example, stable diffusion. You might be familiar with this tool to generate like, you know, input a text prompt and then you get fantastic images.

Now this takes quite some time. So like a few seconds for each image, but only because it runs on a fancy cluster. If you run it locally on a computer, it takes much longer. And that's because the Scorby's diffusion model needs many discretization steps in denoising, in this denoising process during inference time. And now there's, like, throughout the last year, there have been a few attempts on having these very expressive and super powerful neural networks.

But they are much, much faster because they don't have these many denoising steps. Instead, they directly learn a one -step inference. So they could generate an image not like a thousand steps, but only in one step. And that's very cutting edge or bleeding edge, if you will, because they don't work that great yet. But I think there's much potential in there. it's both expressive and fast. And then again, we've used some of those for amortized Bayesian inference.

So we use consistency models and they have super high potential in my opinion. So, you know, with these advances in deep learning, we can always, oftentimes we can use them for amortized Bayesian inference. We just like reformulate these generative models and slightly tune them to our tasks. So I'm very excited about this. And the second area I'm very excited about our foundation models. I guess most people are in AI these days.

So foundation models essentially means neural networks are very good at in -distribution tasks. So whatever is in the training data set, neural networks are typically very good at finding patterns that are similar to the training set, what they saw in the training set. Now in the open world, so if we are out of distribution, we have a domain shift, distribution shift, model mis -specification, however you want to call it, neural networks typically aren't that good.

So what we could do is either make them slightly better at out of distribution, or we just extend the in -distribution to a huge space. And that's what foundation models do. For example, GPD4 would be a foundation model. because it's just trained on so much data. I don't know how many, it's not terabyte anymore. It's like, like essentially the entire internet. So it's just a huge training set. And so the world and the training set that this neural network has been trained on is just huge.

And so essentially we don't really have out of distribution cases anymore, just because our training set is so huge. And that's also one area that could be very useful for amortized Bayesian inference and to overcome the very initial shortcoming that you talked about, where we would also like to amortize over different Asian models. Hmm. I see. Yeah, yeah, yeah. Yeah, that would definitely be super fun.

Yeah, I'm really impressed and interested to see these interaction of like deep learning, artificial intelligence, and then the Bayesian. framework coming on top of that. That is really super cool. I love that. Yeah. Yeah, it makes me super curious to try that stuff out. So to play us out, Marvin, actually, this is a very active area of research. So what advice would you give to beginners interested in diving into this intersection of deep learning and probabilistic machine learning?

That's a great question. Essentially, I would have two recommendations. The first one is to really try to simulate stuff. Whatever it is that you are curious about, just try to write a simulation program and try to simulate some of the data that you might be interested in. So for example, if you're really interested in soccer, then code up a simulation program. that just simulate soccer matches and the outcomes of soccer matches.

So you can really get a feeling of the data generating processes that are happening because probabilistic machine learning at its very core is all about data generating processes and reasoning about these processes. And I think it was Richard Feynman who said, what I cannot create, I do not understand. That's essentially at the heart of simulation based inference in a more narrow setting.

probabilistic machinery and machine learning more broadly or science more broadly even So yeah, definitely like Simulating and running simulation studies can be super helpful both to understand what's happening in the background also to get a feeling for Programming and to get better at programming as well Then the second advice would be to essentially find a balance between these hands -on getting your hands dirty type of things like implement a model and

I torch or Keras or solve some Kaggle tasks, just some machine learning tasks. But then at the same time, also finding this balance to reading books and finding new information to make sure that you actually know what you're doing and also know what you don't know and what the next steps are to get better from the theoretical part. And there are two books that I can really recommend. The first one is Deep Learning by Ian Goodfellow. It's also available. for free online.

You can also link to this in the show notes. It's a great book and it covers so much. And then if you come from this Bayesian or statistics background, you see a lot of conditional probabilities in there because a lot of deep learning is just conditional generative modeling. And then the second book would in fact be Statistical Rethinking by Richard McAlrath. It's a great book and it's not only limited to Bayesian inference, but more. Also a lot of causal inference, of course.

Also just thinking about probability and the philosophy behind this whole probabilistic modeling topic more broadly. So earlier today, I had a chat with one of the student assistants that I'm supervising and he said, Hey Marvin, like I read statistic rethinking a few weeks ago. And today I read something about score -based diffusion models. So these like state of the art deep learning models that are used to generate images. He said like, because I read statistical rethinking, it all made sense.

There's so much probability going on in these score -based diffusion models. And statistical rethinking really helped me understand that. And at first I didn't really, I couldn't believe it, but it totally makes sense. Cause like statistical rethinking is not just a book about Bayesian workflow and Bayesian modeling, but more about, you know, reasoning about probabilities and uncertainty, in a more general way. And it's a beautiful book. So I'd recommend those.

Nice. Yeah. So definitely let's put those two in the show notes. Marvin, I will. So of course I've read statistical rethinking several times, so I definitely agree. The first one about deep learning, I haven't yet, but I will definitely read it because that sounds really fascinating. So really want to get that book. Fantastic. Well, thanks a lot, Marvin. That was really awesome. I really learned a lot. I'm pretty sure listeners did too, so that's super fun.

You definitely need to come back to do a modeling webinar with us and show us in action what we talked about today with the Base Vlog Package. It's also, I guess, going to inspire people to use it and maybe contribute to it. But before that, of course, I'm going to ask you the last two questions I ask every guest at the end of the show. First one, if you had unlimited time and resources, which problem would you try to solve?

That's a very loaded question because there's so many very, very important problems to solve. Like big picture problems, like peace, world hunger, global warming, all those. I'm afraid I couldn't, like with my background, I don't really know how to contribute significantly with a huge impact to those problems.

So my consideration is essentially a trade -off between like... how important is the problem and what impact does solving the problem or addressing the problem have and what impact could I have on solving the problem? And so I think what would be very nice is to make probabilistic inference or Bayesian inference more particular, like accessible, usable, easy and fast for everyone. And that doesn't just mean, you know, methods, machine learning researchers.

But essentially means anyone who works with data in any way. And there's so much to do, like the actual Bayesian model in the background, it could be huge, be like a base GPT, like chat GPT, but just for base. Just with the sheer scope of amortization, different models, different settings and so on. So that's a huge, huge challenge. Like on the backend side, but then on the front end and API side, I think it also has... many different sub problems there.

cause it would mean like people could just, you know, write down a description of their model in plain text language, like a large language model. And, you know, don't actually specify everything by a programming. Maybe also just sketch out some data like expert elicitation and all those different topics. I think there's like this bigger picture, that, you know, so like. thousands of researchers worldwide are working on so many niche topics there.

But having this overarching base GPT kind of thing would be really cool. So I probably choose that to work on. It's a very risky thing, so that's why I'm not currently working on it. Yeah, I love that. Yeah, that sounds awesome. Feel free to corporate. and collaborate with me on that. I would definitely be down. That sounds absolutely amazing. Yeah. So send me an email when you start working that place. I'll be happy to join the team.

And second question, if you could have dinner with any great scientific mind, dead, alive or fictional, who would it be? Again, very loaded question. Super interesting question. I mean, there are two huge choices. I could either go with someone who's currently alive and I feel like I want their take on the current state of the art and future directions and so on. And the second huge option, what I guess many people would go with is someone who's been dead for two to three centuries.

And I think I'd go with the second choice. So really take someone from way from the past. And that's because of two reasons. I think like, of course, speaking to today's scientists is super interesting and I would love to do that. But I mean, they have access to all the state of the art technology and they know about all the latest advancements. And so if they have some groundbreaking creative ideas to share that they come up with, they could just implement it and make them actionable.

And the second reason is that today scientists have a huge platform because they're on the internet. So if they really want to express an idea, they could just do it on Twitter or wherever So there's like other ways to engage with them apart from you know, having a magical dinner Right. so I would choose someone from the past and in particular.

I think at a lovelace would be super interesting for me to talk to Essentially because she's widely considered the first programmer the craziest thing about is that is She's never had access to like a modern computer So she wrote the first program, but the machine wasn't there yet. So that's such a huge leap of creativity and genius.

And so I'd really be interested in like if Adelavelis saw what's happening today, like all the technology that we have with generative AI, GPU clusters and all these possibilities, like what's the next leap forward? Like what's today's equivalent of writing the first program without having the computer. Yeah, I really love to know this answer and there's currently no other way except for your magical dinner invitation to get this answer. So that's why I go with this option.

Yeah. Yeah. No, awesome. Awesome. I love it. That definitely sounds like a, like a marvelous dinner. So yeah. Awesome. Thanks a lot, Marvin. That was, that was really a blast. I'm going to let you go now because you've been talking for a long time, guessing you need a break. But that was really amazing. So yeah, thanks a lot for taking the time. Thanks again to Matt Rosinski for this awesome recommendation. I hope you loved it, Marvin. And also Matt, me, I did. So that was really awesome.

As usual, I'll put resources and a link to your website. And also, Marvin is going to add stuff to the show notes for those who want to dig deeper. Thank you again, Marvin, for taking the time and being on this show. Thank you very much for having me, Alex. I appreciate it. This has been another episode of Learning Bayesian Statistics. Be sure to rate, review and follow the show on your favorite podcatcher and visit

learnbaystats .com for more resources about today's topics as well as access to more episodes to help you reach true Bayesian state of mind. That's learnbaystats .com. Our theme music is Good Bayesian by Baba Brinkman, fit MC Lars and Meghiraam. Check out his awesome work at bababrinkman .com. I'm your host. Alex Andorra. You can follow me on Twitter at Alex underscore Andorra, like the country. You can support the show and unlock exclusive benefits by visiting Patreon

.com slash LearnBasedDance. Thank you so much for listening and for your support. You're truly a good Bayesian change your predictions after taking information. And if you're thinking I'll be less than amazing, let's adjust those expectations. Let me show you how to be a good Bayesian Change calculations after taking fresh data in Those predictions that your brain is making Let's get them on a solid foundation

Transcript source: Provided by creator in RSS feed: download file