#104 Automated Gaussian Processes & Sequential Monte Carlo, with Feras Saad - podcast episode cover

#104 Automated Gaussian Processes & Sequential Monte Carlo, with Feras Saad

Apr 16, 20242 hr 31 minSeason 1Ep. 104
--:--
--:--
Listen in podcast apps:

Episode description

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!


GPs are extremely powerful…. but hard to handle. One of the bottlenecks is learning the appropriate kernel. What if you could learn the structure of GP kernels automatically? Sounds really cool, but also a bit futuristic, doesn’t it?

Well, think again, because in this episode, Feras Saad will teach us how to do just that! Feras is an Assistant Professor in the Computer Science Department at Carnegie Mellon University. He received his PhD in Computer Science from MIT, and, most importantly for our conversation, he’s the creator of AutoGP.jl, a Julia package for automatic Gaussian process modeling.

Feras discusses the implementation of AutoGP, how it scales, what you can do with it, and how you can integrate its outputs in your models.

Finally, Feras provides an overview of Sequential Monte Carlo and its usefulness in AutoGP, highlighting the ability of SMC to incorporate new data in a streaming fashion and explore multiple modes efficiently.

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell and Gal Kampel.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag ;)

Takeaways:

- AutoGP is a Julia package for automatic Gaussian process modeling that learns the

Transcript

GPs are extremely powerful, but hard to handle. One of the bottlenecks is learning the appropriate kernels. Well, what if you could learn the structure of GP's kernels automatically? Sounds really cool, right? But also, eh, a bit futuristic, doesn't it? Well, think again, because in this episode, Farah Saad will teach us how to do just that. Feras is an assistant professor in the computer science department at Carnegie Mellon University. He received his PhD in computer science from MIT.

And most importantly for our conversation, he's the creator of AutoGP .jl, a Julia package for automatic Gaussian process modeling. Feras discusses the implementation of AutoGP, how it scales, what you can do with it, and how you can integrate its outputs in your patient models. Finally, DeepFerence provides an overview of Sequential Monte Carlo and its usefulness in AutoGP, highlighting the ability of SMC to incorporate new data in a streaming fashion and explore multiple modes efficiently.

This is Learning Basics Statistics, episode 104, recorded February 23, 2024. Welcome to Learning Bayesian Statistics, a podcast about Bayesian inference, the methods, the projects, and the people who make it possible. I'm your host, Alex Andorra. You can follow me on Twitter at alex .andorra, like the country, for any info about the show. LearnBayStats .com is Laplace to me.

Show notes, becoming a corporate sponsor, unlocking Bayesian Merge, supporting the show on Patreon, everything is in there. That's learnbasedats .com. If you're interested in one -on -one mentorship, online courses, or statistical consulting, feel free to reach out and book a call at topmate .io slash alex underscore and dora. See you around, folks, and best Bayesian wishes to you all. idea patients.

First, I want to thank Edwin Saveliev, Frederic Ayala, Jeffrey Powell, and Gala Campbell for supporting the show. Patreon, your support is invaluable, guys, and literally makes this show possible. I cannot wait to talk with you in the Slack channel. Second, I have an exciting modeling webinar coming up on April 18 with Juan Ardus, a fellow PyMC Core Dev and mathematician.

In this modeling webinar, we'll learn how to use the new HSGP approximation for fast and efficient Gaussian processes, we'll simplify the foundational concepts, explain why this technique is so useful and innovative, and of course, we'll show you a real -world application in PyMC. So if that sounds like fun, Go to topmade .io slash Alex underscore and Dora to secure your seat.

Of course, if you're a patron of the show, you get bonuses like submitting questions in advance, early access to the recording, et cetera. You are my favorite listeners after all. Okay, back to the show now. Arasad, welcome to Learning Vision Statistics. Hi, thank you. Thanks for the invitation. I'm delighted to be here. Yeah, thanks a lot for taking the time. Thanks a lot to Colin Carroll. who of course listeners know, he was in episode 3 of Uninvasioned Statistics.

Well I will of course put it in the show notes, that's like a vintage episode now, from 4 years ago. I was a complete beginner in invasion stats, so if you wanna embarrass myself, definitely that's one of the episodes you should listen to without my - my beginner's questions, and that's one of the rare episodes I could do on site. I was with Colleen in person to record that episode in Boston. So, hi Colleen, thanks a lot again. And Feres, let's talk about you first.

How would you define the work you're doing nowadays? And also, how did you end up doing that? Yeah, yeah, thanks. And yeah, thanks for calling Carol for setting up this connection. I've been watching the podcast for a while and I think it's really great how you've brought together lots of different people in the Bayesian inference community, the statistics community to talk about their work. So thank you and thank you to Colin for that connection. Yeah, so a little background about me.

I'm a professor at CMU and I'm working in... a few different areas surrounding Bayesian inference with my colleagues and students. One, I think, you know, I like to think of the work I do as following different threads, which are all unified by this idea of probability and computation.

So one area that I work a lot in, and I'm sure you have lots of experience in this, being one of the core developers of PyMC, is probabilistic programming languages and developing new tools that help both high level users and also machine learning experts and statistics experts more easily use Bayesian models and inferences as part of their workflow.

The, you know, putting my programming languages hat on, it's important to think about not only how do we make it easier for people to write up Bayesian inference workflows, but also what kind of guarantees or what kind of help can we give them in terms of verifying the correctness of their implementations or. automating the process of getting these probabilistic programs to begin with using probabilistic program synthesis techniques.

So these are questions that are very challenging and, you know, if we're able to solve them, you know, really can go a long way. So there's a lot of work in the probabilistic programming world that I do, and I'm specifically interested in probabilistic programming languages that support programmable inference.

So we can think of many probabilistic programming languages like Stan or Bugs or PyMC as largely having a single inference algorithm that they're going to use multiple times for all the different programs you can express. So bugs might use Gibbs sampling, Stan uses HMC with nuts, PyMC uses MCMC algorithms, and these are all great. But of course, one of the limitations is there's no universal inference algorithm that works well for any problem you might want to express.

And that's where I think a lot of the power of programmable inference comes in. A lot of where the interesting research is as well, right? Like how can you support users writing their own say MCMC proposal for a given Bayesian inference problem and verify that that proposal distribution meets the theoretical conditions needed for soundness, whether it's defining a reducible chain, for example, or whether it's a periodic.

or in the context of variational inference, whether you define the variational family that is broad enough, so it's support encompasses the support of the target model. We have all of these conditions that we usually hope are correct, but our systems don't actually verify that for us, whether it's an MCMC or variational inference or importance sampling or sequential Monte Carlo.

And I think the more flexibility we give programmers, And I touched upon this a little bit by talking about probabilistic program synthesis, which is this idea of probabilistic, automated probabilistic model discovery. And there, our goal is to use hierarchical Bayesian models to specify prior distributions, not only over model parameters, but also over model structures.

And here, this is based on this idea that traditionally in statistics, a data scientist or an expert, we'll hand design a Bayesian model for a given problem, but oftentimes it's not obvious what's the right model to use. So the idea is, you know, how can we use the observed data to guide our decisions about what is the right model structure to even be using before we worry about parameter inference? So, you know, we've looked at this problem in the context of learning models of time series data.

Should my time series data have a periodic component? Should it have polynomial trends? Should it have a change point? right? You know, how can we automate the discovery of these different patterns and then learn an appropriate probabilistic model?

And I think it ties in very nicely to probabilistic programming because probabilistic programs are so expressive that we can express prior distributions on structures or prior distributions on probabilistic programs all within the system using this unified technology. Yeah. Which is where, you know, these two research areas really inform one another.

If we're able to express rich probabilistic programming languages, then we can start doing inference over probabilistic programs themselves and try and synthesize these programs from data. Other areas that I've looked at are tabular data or relational data models, different types of traditionally structured data, and synthesizing models there. And the workhorse in that area is largely Bayesian non -parametrics.

So prior distributions over unbounded spaces of latent variables, which are, I think, a very mathematically elegant way to treat probabilistic structure discovery using Bayesian inferences as the workhorse for that.

And I'll just touch upon a few other areas that I work in, which are also quite aligned, which a third area I work in is more on the computational statistics side, which is now that we have probabilistic programs and we're using them and they're becoming more and more routine in the workflow of Bayesian inference, we need to start thinking about new statistical methods and testing methods for these probabilistic programs.

So for example, this is a little bit different than traditional statistics where, you know, traditionally in statistics we might some type of analytic mathematical derivation on some probability model, right? So you might write up your model by hand, and then you might, you know, if you want to compute some property, you'll treat the model as some kind of mathematical expression. But now that we have programs, these programs are often far too hard to formalize mathematically by hand.

So if you want to analyze their properties, how can we understand the properties of a program? By simulating it. So a very simple example of this would be, say I wrote a probabilistic program for some given data, and I actually have the data. Then I'd like to know whether the probabilistic program I wrote is even a reasonable prior from that data. So this is a goodness of fit testing, or how well does the probabilistic program I wrote explain the range of data sets I might see?

So, you know, if you do a goodness of fit test using stats 101, you would look, all right, what is my distribution? What is the CDF? What are the parameters that I'm going to derive some type of thing by hand? But for policy programs, we can't do that. So we might like to simulate data from the program and do some type of analysis based on samples of the program as compared to samples of the observed data.

So these type of simulation -based analyses of statistical properties of probabilistic programs for testing their behavior or for quantifying the information between variables, things like that. And then the final area I'll touch upon is really more at the foundational level, which is. understanding what are the primitive operations, a more rigorous or principled understanding of the primitive operations on our computers that enable us to do random computations. So what do I mean by that?

Well, you know, we love to assume that our computers can freely compute over real numbers. But of course, computers don't have real numbers built within them. They're built on finite precision machines, right, which means I can't express. some arbitrary division between two real numbers. Everything is at some level it's floating point. And so this gives us a gap between the theory and the practice.

Because in theory, you know, whenever we're writing our models, we assume everything is in this, you know, infinitely precise universe. But when we actually implement it, there's some level of approximation. So I'm interested in understanding first, theoretically, what is this approximation? How important is it that I'm actually treating my model as running on an infinitely precise machine where I actually have finite precision?

And second, what are the implications of that gap for Bayesian inference? Does it mean that now I actually have some properties of my Markov chain that no longer hold because I'm actually running it on a finite precision machine whereby all my analysis was assuming I have an infinite precision or what does it mean about the actual variables we generate? So, you know, we might generate a Gaussian random variable, but in practice, the variable we're simulating has some other distribution.

Can we theoretically quantify that other distribution and its error with respect to the true distribution? Or have we come up with sampling procedures that are as close as possible to the ideal real value distribution? And so this brings together ideas from information theory, from theoretical computer science. And one of the motivations is to thread those results through into the actual Bayesian inference procedures that we implement using probabilistic programming languages.

So that's just, you know, an overview of these three or four different areas that I'm interested in and I've been working on recently. Yeah, that's amazing. Thanks a lot for these, like full panel of what you're doing. And yeah, that's just incredible also that you're doing so many things. I'm really impressed. And of course we're going to dive a bit into these, at least some of these topics. I don't want to take three hours of your time, but...

Before that though, I'm curious if you remembered when and how you first got introduced to Bayesian inference and also why it's ticked with you because it seems like it's underpinning most of your work, at least that idea of probabilistic programming. Yeah, that's a good question. I think I was first interested in probability before I was interested in Bayesian inference. I remember... I used to read a book by Maasteller called 50 Challenging Problems in Probability.

I took a course in high school and I thought, how could I actually use these cool ideas for fun? And there was actually a very nice book written back in the 50s by Maasteller. So that got me interested in probability and how we can use probability to reason about real world phenomena.

So the book that... that I used to read would sort of have these questions about, you know, if someone misses a train and the train has a certain schedule, what's the probability that they'll arrive at the right time? And it's a really nice book because it ties in our everyday experiences with probabilistic modeling and inference.

And so I thought, wow, this is actually a really powerful paradigm for reasoning about the everyday things that we do, like, you know, missing a bus and knowing something about its schedule and when's the right time that I should arrive to maximize the probability of, you know, some, some, some, some, event of interest, things like that. So that really got me hooked to the idea of probability.

But I think what really connected Bayesian inference to me was taking, I think this was as a senior or as a first year master's student, a course by Professor Josh Tannenbaum at MIT, which is computational cognitive science. And that course has evolved.

quiet a lot through the years, but the version that I took was really a beautiful synthesis of lots of deep ideas of how Bayesian inference can tell us something meaningful about how humans reason about, you know, different empirical phenomena and cognition.

So, you know, in cognitive science for, you know, for... majority of the history of the field, people would run these experiments on humans and they would try and analyze these experiments using some type of, you know, frequentist statistics or they would not really use generative models to describe how humans are are solving a particular experiment. But the, you know, Professor Tenenbaum's approach was to use Bayesian models.

as a way of describing or at least emulating the cognitive processes that humans do for solving these types of cognition tasks. And by cognition tasks, I mean, you know, simple experiments you might ask a human to do, which is, you know, you might have some dots on a screen and you might tell them, all right, you've seen five dots, why don't you extrapolate the next five? Just simple things that, simple cognitive experiments or, you know, yeah, so.

I think that being able to use Bayesian models to describe very simple cognitive phenomena was another really appealing prospect to me throughout that course. I'm seeing all the ways in which that manifested in very nice questions about. how do we do efficient inference in real time? Because humans are able to do inference very quickly. And Bayesian inference is obviously very challenging to do.

But then, if we actually want to engineer systems, we need to think about the hard questions of efficient and scalable inference in real time, maybe at human level speeds. Which brought in a lot of the reason for why I'm so interested in inference as well. Because that's one of the harder aspects of Bayesian computing. And then I think a third thing which really hooked me to Bayesian inference was taking a machine learning course and kind of comparing.

So the way these machine learning courses work is they'll teach you empirical risk minimization, and then they'll teach you some type of optimization, and then there'll be a lecture called Bayesian inference. And... What was so interesting to me at the time was up until the time, up until the lecture where we learned anything about Bayesian inference, all of these machine learning concepts seem to just be a hodgepodge of random tools and techniques that people were using.

So I, you know, there's the support vector machine and it's good at classification and then there's the random forest and it's good at this. But what's really nice about using Bayesian inference in the machine learning setting, or at least what I found appealing was how you have a very clean specification of the problem that you're trying to solve in terms of number one, a prior distribution.

over parameters and observable data, and then the actual observed data, and three, which is the posterior distribution that you're trying to infer. So you can use a very nice high -level specification of what is even the problem you're trying to solve before you even worry about how you solve it.

you can very cleanly separate modeling and inference, whereby most of the machine learning techniques that I was initially reading or learning about seem to be only focused on how do I infer something without crisply formalizing the problem that I'm trying to solve. And then, you know, just, yeah. And then, yeah.

So once we have this Bayesian posterior that we're trying to infer, then maybe we'll do fully Bayesian inference, or maybe we'll do approximate Bayesian inference, or maybe we'll just do maximum likelihood. That's maybe less of a detail. The more important detail is we have a very clean specification for our problem and we can, you know, build in our assumptions. And as we change our assumptions, we change the specification.

So it seemed like a very systematic way, very systematic way to build machine learning and artificial intelligence pipelines. using a principled process that I found easy to reason about. And I didn't really find that in the other types of machine learning approaches that we learned in the class. So yeah, so I joined the probabilistic computing project at MIT, which is run by my PhD advisor, Dr. Vikash Mansinga.

And, um, you really got the opportunity to explore these interests at the research level, not only in classes. And that's, I think where everything took off afterwards. Those are the synthesis of various things, I think that got me interested in the field. Yeah. Thanks a lot for that, for that, that that's super interesting to see. And, uh, I definitely relate to the idea of these, um, like the patient framework being, uh, attractive.

not because it's a toolbox, but because it's more of a principle based framework, basically, where instead of thinking, oh yeah, what tool do I need for that stuff, it's just always the same in a way. To me, it's cool because you don't have to be smart all the time in a way, right? You're just like, it's the problem takes the same workflow. It's not going to be the same solution. But it's always the same workflow. Okay. What does the data look like? How can we model that?

Where is the data generative story? And then you have very different challenges all the time and different kinds of models, but you're not thinking about, okay, what is the ready made model that they can apply to these data? It's more like how can I create a custom model to these data knowing the constraints I have about my problem? And. thinking in a principled way instead of thinking in a toolkit way. I definitely relate to that. I find that amazing.

I'll just add to that, which is this is not only some type of aesthetic or theoretical idea. I think it's actually strongly tied into good practice that makes it easier to solve problems. And by that, what do I mean? Well, so I did a very brief undergraduate research project in a biology lab, computational biology lab.

And just looking at the empirical workflow that was done, made me very suspicious about the process, which is, you know, you might have some data and then you'll hit it with PCA and you'll get some projection of the data and then you'll use a random forest classifier and you're going to classify it in different ways. And then you're going to use the classification and some type of logistic regression. So you're just chaining these ad hoc different data analyses to come up with some final story.

And while that might be okay to get you some specific result, it doesn't really tell you anything about how changing one modeling choice in this pipeline.

is going to impact your final inference because this sort of mix and match approach of applying different ad hoc estimators to solve different subtasks doesn't really give us a way to iterate on our models, understand their limitations very well, knowing their sensitivity to different choices, or even building computational systems that automate a lot of these things, right? Like probabilistic programs. Like you're saying, we can write our data generating process as the workflow itself, right?

Rather than, you know, maybe in Matlab I'll run PCA and then, you know, I'll use scikit -learn and Python. Without, I think, this type of prior distribution over our data, it becomes very hard to reason formally about our entire inference workflow, which would... know, which probabilistic programming languages are trying to make easier and give a more principled approach that's more amenable to engineering, to optimization, to things of that sort. Yeah. Yeah, yeah. Fantastic point. Definitely.

And that's also the way I personally tend to teach patient stats. Now it's much more on a, let's say, principle -based way instead of, and workflow -based instead of just... Okay, Poisson regression is this multinomial regression is that I find that much more powerful because then when students get out in the wild, they are used to first think about the problem and then try to see how they could solve it instead of just trying to find, okay, which model is going to be the most.

useful here in the models that I already know, because then if the data are different, you're going to have a lot of problems. Yeah. And so you actually talked about the different topics that you work on. There are a lot I want to ask you about. One of my favorites, and actually I think Colin also has been working a bit on that lately. is the development of AutoGP .jl. So I think that'd be cool to talk about that. What inspired you to develop that package, which is in Julia?

Maybe you can also talk about that if you mainly develop in Julia most of the time, or if that was mostly useful for that project. And how does this package... advance, like help the learning structure of Gaussian Processes kernels because if I understand correctly, that's what the package is mostly about. So yeah, if you can give a primer to listeners about that. Definitely. Yes. So Gaussian Processes are a pretty standard model that's used in many different application areas.

spatial temporal statistics and many engineering applications based on optimization. So these Gaussian process models are parameterized by covariance functions, which specify how the data produced by this Gaussian process co -varies across time, across space, across any domain which you're able to define some type of covariance function.

But one of the main challenges in using a Gaussian process for modeling your data, is making the structural choice about what should the covariance structure be. So, you know, the one of the universal choices or the most common choices is to say, you know, some type of a radial basis function for my data, the RBF kernel, or, you know, maybe a linear kernel or a polynomial kernel, somehow hoping that you'll make the right choice to model your data accurately.

So the inspiration for auto GP or automatic Gaussian process is to try and use the data not only to infer the numeric parameters of the Gaussian process, but also the structural parameters or the actual symbolic structure of this covariance function.

And here we are drawing our inspiration from work which is maybe almost 10 years now from Dave Duvenoe and colleagues called the Automated Statistician Project, or ABCD, Automatic Bayesian Covariance Discovery, which introduced this idea of defining a symbolic language. over Gaussian process covariance functions or covariance kernels and using a grammar, using a recursive grammar and trying to infer an expression in that grammar given the observed data.

So, you know, in a time series setting, for example, you might have time on the horizontal axis and the variable on the y -axis and you just have some variable that's evolving. You don't know necessarily the dynamics of that, right? There might be some periodic structure in the data or there might be multiple periodic effects. Or there might be a linear trend that's overlaying the data.

Or there might be a point in time in which the data is switching between some process before the change point and some process after the change point. Obviously, for example, in the COVID era, almost all macroeconomic data sets had some type of change point around April 2020. And we see that in the empirical data that we're analyzing today. So the question is, how can we automatically surface these structural choices? using Bayesian inference.

So the original approach that was in the automated statistician was based on a type of greedy search. So they were trying to say, let's find the single kernel that maximizes the probability of the data. Okay. So they're trying to do a greedy search over these kernel structures for Gaussian processes using these different search operators. And for each different kernel, you might find the maximum likelihood parameter, et cetera. And I think that's a fine approach.

But it does run into some serious limitations, and I'll mention a few of them. One limitation is that greedy search is in a sense not representing any uncertainty about what's the right structure. It's just finding a single best structure to maximize some probability or maybe likelihood of the data. But we know just like parameters are uncertain, structure can also be quite uncertain because the data is very noisy. We may have sparse data.

And so, you know, we'd want type of inference systems that are more robust. when discovering the temporal structure in the data and that greedy search doesn't really give us that level of robustness through expressing posterior uncertainty. I think another challenge with greedy search is its scalability. And by that, if you have a very large data set in a greedy search algorithm, we're typically at each stage of the search, we're looking at the entire data set to score our model.

And this is also a traditional Markov chain Monte Carlo algorithms. We often score our data set, but in the Gaussian process setting, scoring the data set is very expensive. If you have N data points, it's going to cost you N cubed. And so it becomes quite infeasible to run greedy search or even pure Markov chain Monte Carlo, where at each step, each time you change the parameters or you change the kernel, you need to now compute the full likelihood.

And so the second motivation in AutoGP is to build an inference algorithm. that is not looking at the whole data set at each point in time, but using subsets of the data set that are sequentially growing. And that's where the sequential Monte Carlo inference algorithm comes in. So AutoGP is implemented in Julia. And the API is that basically you give it a one -dimensional time series. You hit infer.

And then it's going to report an ensemble of Gaussian processes or a sample from my posterior distribution, where each Gaussian process has some particular structure and some numeric parameters. And you can show the user, hey, I've inferred these hundred GPS from my posterior. And then they can start using them for generating predictions. You can use them to find outliers because these are probabilistic models. You can use them for a lot of interesting tasks.

Or you might say, you know, This particular model actually isn't consistent with what I know about the data. So you might remove one of the posterior samples from your ensemble. Yeah, so those are, you know, we used AutoGP on the M3. We benchmarked it on the M3 competition data. M3 is around, or the monthly data sets in M3 are around 1 ,500 time series, you know, between 100 and 500 observations in length.

And we compared the performance against different statistics baselines and machine learning baselines. And it's actually able to find pretty common sense structures in these economic data. Some of them have seasonal features, multiple seasonal effects as well. And what's interesting is we don't need to customize the prior to analyze each data set. It's essentially able to discover.

And what's also interesting is that sometimes when the data set just looks like a random walk, it's going to learn a covariance structure, which emulates a random walk. So by having a very broad prior distribution on the types of covariance structures that you see, it's able to find which of these are plausible explanation given the data. Yes, as you mentioned, we implemented this in Julia.

The reason is that AutoGP is built on the Gen probabilistic programming language, which is embedded in the Julia language. And the reason that Gen, I think, is a very useful system for this problem. So Gen was developed primarily by Marco Cosumano Towner, who wrote a PhD thesis. He was a colleague of mine at the MIT Policy Computing Project. And Gen really, it's a Turing complete language and has programmable inference.

So you're able to write a prior distribution over these symbolic expressions in a very natural way. And you're able to customize an inference algorithm that's able to solve this problem efficiently. And What really drew us to GEN for this problem, I think, are twofold. The first is its support for sequential Monte Carlo inference. So it has a pretty mature library for doing sequential Monte Carlo.

And sequential Monte Carlo construed more generally than just particle filtering, but other types of inference over sequences of probability distributions. So particle filters are one type of sequential Monte Carlo algorithm you might write. But you might do some type of temperature annealing or data annealing or other types of sequentialization strategies. And Jen provides a very nice toolbox and abstraction for experimenting with different types of sequential Monte Carlo approaches.

And so we definitely made good use of that library when developing our inference algorithm. The second reason I think that Jen was very nice to use is its library for involutive MCMC. And involutive MCMC, it's a relatively new framework. It was discovered, I think, concurrently. and independently both by Marco and other folks. And this is kind of, you can think of it as a generalization of reversible jump MCMC.

And it's really a unifying framework to understand many different MCMC algorithms using a common terminology. And so there's a wonderful ICML paper which lists 30 or so different algorithms that people use all the time like Hamiltonian Monte Carlo, reversible jump MCMC, Gibbs sampling, Metropolis Hastings. and expresses them using the language of involutive MCMC. I believe the author is Nick Liudov, although I might be mispronouncing that, sorry for that.

So, Jen has a library for involutive MCMC, which makes it quite easy to write different proposals for how you do this inference over your symbolic expressions. Because when you're doing MCMC within the inner loop of a sequential Monte Carlo algorithm, You need to somehow be able to improve your current symbolic expressions for the covariance kernel, given the observed data. And, uh, doing that is, is hard because this is kind of a reversible jump algorithm where you make a structural change.

Then you need to maybe generate some new parameters. You need the reverse probability of going back. And so Jen has a high level, has a lot of automation and a library for implementing these types of structure moves in a very high level way. And it automates the low level math for. computing the acceptance probability and embedding all of that within an outer level SMC loop.

And so this is, I think, one of my favorite examples for what probabilistic programming can give us, which is very expressive priors over these, you know, symbolic expressions generated by symbolic grammars, powerful inference algorithms using combinations of sequential Monte Carlo and involutive MCMC and reversible jump moves and gradient based inference over the parameters. It really brings together a lot of the a lot of the strengths of probabilistic programming languages.

And we showed at least on these M3 datasets that they can actually be quite competitive with state -of -the -art solutions, both in statistics and in machine learning. I will say, though, that as with traditional GPs, the scalability is really in the likelihood.

So whether AutoGP can handle datasets with 10 ,000 data points, it's actually too hard because ultimately, Once you've seen all the data in your sequential Monte Carlo, you will be forced to do this sort of N cubed scaling, which then, you know, you need some type of improvements or some type of approximation for handling larger data.

But I think what's more interesting in AutoGP is not necessarily that it's applied to inferring structures of Gaussian processes, but that it's sort of a library for inferring probabilistic structure and showing how to do that by integrating these different inference methodologies. Hmm. Okay. Yeah, so many things here. So first, I put all the links to autogp .jl in the show notes.

I also put a link to the underlying paper that you've written with some co -authors about, well, the sequential Monte Carlo learning that you're doing to discover these time -series structure for people who want to dig deeper. And I put also a link to all, well, most of the LBS episodes where we talk about Gaussian processes for people who need a bit more background information because here we're mainly going to talk about how you do that and so on and how useful is it.

And we're not going to give a primer on what Gaussian processes are. So if you want that, folks, there are a bunch of episodes in the show notes for that. So... on that basically practical utility of that time -series discovery. So if understood correctly, for now, you can do that only on one -dimensional input data. So that would be basically on a time series. You cannot input, let's say, that you have categories. These could be age groups.

So. you could one -hot, usually I think that's the way it's done, how to give that to a GP would be to one -hot encode each of these edge groups. And then that means, let's say you have four edge group. Now the input dimension of your GP is not one, which is time, but it's five. So one for time and four for the edge groups. This would not work here, right? Right, yes.

So at the moment, we're focused on, and these are called, I guess, in econometrics, pure time series models, where you're only trying to do inference on the time series based on its own history. I think the extensions that you're proposing are very natural to consider. You might have a multi -input Gaussian process where you're not only looking at your own history, but you're also considering some type of categorical variable.

Or you might have exogenous covariates evolving along with the time series. If you want to predict temperature, for example, you might have the wind speed and you might want to use that as a feature for your Gaussian process. Or you might have an output, a multiple output Gaussian process. You want a Gaussian process over multiple different time series generally. And I think all of these variants are, you know, they're possible to develop.

There's no fundamental difficulty, but the main, I think the main challenge is how can you define a domain specific language over these covariance structures for multi, for multivariate input data? becomes a little bit more challenging. So in the time series setting, what's nice is we can interpret how any type of covariance kernel is going to impact the actual prior over time series.

Once we're in the multi -dimensional setting, we need to think about how to combine the kernels for different dimensions in a way that's actually meaningful for modeling to ensure that it's more tractable. But I think extensions of the DSL to handle multiple inputs, exogenous covariates, multiple outputs, These are all great directions. And I'll just add on top of that, I think another important direction is using some of the more recent approximations for Gaussian processes.

So we're not bottlenecked by the n cubed scaling. So there are, I think, a few different approaches that have been developed. There are approaches which are based on stochastic PDEs or state space approximations of Gaussian processes, which are quite promising. There's some other things like nearest neighbor Gaussian processes, but I'm a little less confident about those because we lose a lot of the nice affordances of GPs once we start doing nearest neighbor approximations.

But I think there's a lot of new methods for approximate GPs. So we might do a stochastic variational inference, for example, an SVGP. So I think as we think about handling more more richer types of data, then we should also think about how to start introducing some of these more scalable approximations to make sure we can still efficiently do the structure learning in that setting. Yeah, that would be awesome for sure. As a more, much more on the practitioner side than on the math side.

Of course, that's where my head goes first. You know, I'm like, oh, that'd be awesome, but I would need to have that to have it really practical. Um, and so if I use auto GP dot channel, so I give it a time series data. Um, then what do I get back? Do I get back, um, the busier samples of the, the implied model, or do I get back the covariance structure?

So that could be, I don't know what, what form that could be, but I'm thinking, you know, Uh, often when I use GPS, I use them inside other models with other, like I could use a GP in a linear regression, for instance.

And so I'm thinking that'd be cool if I'm not sure about the covariance structure, especially if it can do the discovery of the seasonality and things like that automatically, because it's always seasonality is a bit weird and you have to add another GP that can handle periodicity. Um, and then you have basically a sum of GP. And then you can take that sum of GP and put that in the linear predictor of the linear regression. That's usually how I use that.

And very often, I'm using categorical predictors almost always. And I'm thinking what would be super cool is that I can outsource that discovery part of the GP to the computer like you're doing with this algorithm. And then I get back under what form? I don't know yet. I'm just thinking about that. this covariance structure that I can just, which would be an MV normal, like a multivit normal in a way, that I just use in my linear predictor.

And then I can use that, for instance, in a PMC model or something like that, without to specify the GP myself. Is it something that's doable? Yeah, yeah, I think that's absolutely right. So you can, because Gaussian processes are compositional, just, you know, you mentioned the sum of two Gaussian processes, which corresponds to the sum of two kernel. So if I have Gaussian process one plus Gaussian process two, that's the same as the Gaussian process whose covariance is k1 plus k2.

And so what that means is we can take our synthesized kernel, which is comprised of some base kernels and then maybe sums and products and change points, and we can wrap all of these in just one mega GP, basically, which would encode the entire posterior disk or, you know, a summary of all of the samples in one GP. Another, and I think you also mentioned an important point, which is multivariate normals. You can also think of the posterior as just a mixture of these multivariate normals.

So let's say I'm not going to sort of compress them into a single GP, but I'm actually going to represent the output of auto GP as a mixture of multivariate normals. And that would be another type of API. So depending on exactly what type of how you're planning to use the GP, I think you can use the output of auto GP in the right way, because ultimately, it's producing some covariance kernels, you might aggregate them all into a GP, or you might compose them together to make a mixture of GPs.

And you can export this to PyTorch, or most of the current libraries for GPs support composing the GPs with one another, et cetera. So I think depending on the use case, it should be quite straightforward to figure out how to leverage the output of AutoGP to use within the inner loop of some bra or within the internals of some larger linear regression model or other type of model.

Yeah, that's definitely super cool because then you can, well, yeah, use that, outsource that part of the model where I think the algorithm probably... If not now, in just a few years, it's going to make do a better job than most modelers, at least to have a rough first draft. That's right. The first draft. A data scientist who's determined enough to beat AutoGP, probably they can do it if they put in enough effort just to study the data.

But it's getting a first pass model that's actually quite good as compared to other types of automated techs. Yeah, exactly. I mean, that's recall. It's like asking for a first draft of, I don't know, blog post to ChatGPT and then going yourself in there and improving it instead of starting everything from scratch. Yeah, for sure you could do it, but that's not where your value added really lies. So yeah. So what you get is these kind of samples. In a way, do you get back samples?

or do you get symbolic variables back? You get symbolic expressions for the covariance kernels as well as the parameters embedded within them. So you might get, let's say you asked for five posterior samples, you're going to have maybe one posterior sample, which is a linear kernel. And then another posterior sample, which is a linear times linear, so a quadratic kernel. And then maybe a third posterior sample, which is again, a linear, and each of them will have their different parameters.

And because we're using sequential Monte Carlo, all of the posterior samples are associated with weights. The sequential Monte Carlo returns a weighted particle collection, which is approximating the posterior. So you get back these weighted particles, which are symbolic expressions. And we have, in AutoGP, we have a minimal prediction GP library.

So you can actually put these symbolic expressions into a GP to get a functional GP, but you can export them to a text file and then use your favorite GP library and embed them within that as well. And we also get noise parameters. So each kernel is going to be associated with the output noise. Because obviously depending on what kernel you use, you're going to infer a different noise level. So you get a kernel structure, parameters, and noise for each individual particle in your SMC ensemble.

OK, I see. Yeah, super cool. And so yeah, if you can get back that as a text file. Like either you use it in a full Julia program, or if you prefer R or Python, you could use auto -gp .jl just for that. Get back a text file and then use that in R or in Python in another model, for instance. Okay. That's super cool. Do you have examples of that? Yeah. Do you have examples of that we can link to for listeners in the show notes? We have tutorial. And so...

The tutorial, I think, prints, it shows a print of the, it prints the learned structures into the output cells of the IPython notebooks. And so you could take the printed structure and just save it as a text file and write your own little parser for extracting those structures and building an RGP or a PyTorch GP or any other GP. Okay. Yeah. That was super cool. That's awesome. And do you know if there is already an implementation in R? and or in Python of what you're doing in AutoGP .JS?

Yeah, so we, so this project was implemented during my year at Google when I was so between starting at CMU and finishing my PhD, I was at Google for a year as a visiting faculty scientist. And some of the prototype implementations were also in Python. But I think the only public version at the moment is the Julia version.

But I think it's a little bit challenging to reimplement this because one of the things we learned when trying to implement it in Python is that we don't have Gen, or at least at the time we didn't. The reason we focused on Julia is that we could use the power of the Gen probabilistic programming language in a way that made model development and iterating. much more feasible than a pure Python implementation or even, you know, an R implementation or in another language.

Yeah. Okay. Um, and so actually, yeah, so I, I would have so many more questions on that, but I think that's already a good, a good overview of, of that project. Maybe I'm curious about the, the biggest obstacle that you had on the path, uh, when developing that package, autogp .jl, and also what are your future plans for this package? What would you like to see it become in the coming months and years? Yeah. So thanks for those questions.

So for the biggest challenge, I think designing and implementing the inference algorithm that includes... sequential Monte Carlo and involuted MCMC. That was a challenge because there aren't many works, prior works in the literature that have actually explored this type of a combination, which is, um, you know, which is really at the heart of auto GP, um, designing the right proposal distributions for, I have some given structure and I have my data. How do I do a data driven proposal?

So I'm not just blindly proposing some new structure from the prior or some new sub -structure. but actually use the observed data to come up with a smart proposal for how I'm going to improve the structure in the inner loop of MCMC. So we put a lot of thought into the actual move types and how to use the data to come up with data -driven proposal distributions. So the paper describes some of these tricks. So there's moves which are based on replacing a random subtree.

There are moves which are detaching the subtree and throwing everything away or... embedding the subtree within a new tree. So there are these different types of moves, which we found are more helpful to guide the search. And it was a challenging process to figure out how to implement those moves and how to debug them. So that I think was, was part of the challenge.

I think another challenge which, which we came, which we were facing was of course, the fact that we were using these dense Gaussian process models without the actual approximations that are needed to scale to say tens or hundreds of thousands of data points. And so. This I think was part of the motivation for thinking about what are other types of approximations of the GP that would let us handle datasets of that size.

In terms of what I'd like for AutoGP to be in the future, I think there's two answers to that. One answer, and I think there's already a nice success case here, but one answer is I'd like the implementation of AutoGP to be a reference for how to do probabilistic structure discovery using GEN. So I expect that people... across many different disciplines have this problem of not knowing what their specific model is for the data.

And then you might have a prior distribution over symbolic model structures and given your observed data, you want to infer the right model structure. And I think in the auto GP code base, we have a lot of the important components that are needed to apply this workflow to new settings. So I think we've really put a lot of effort in having the code be self -documenting in a sense. and make it easier for people to adapt the code for their own purposes.

And so there was a recent paper this year presented at NURiPS by Tracy Mills and Sam Shayet from Professor Tenenbaum's group that extended the AutoGP package for a task in cognition, which was very nice to see that the code isn't only valuable for its own purpose, but also adaptable by others for other types of tasks.

Um, and I think the second thing that I'd like auto GP or at least the auto GP type models to do is, um, you know, integrating these with, and this goes back to the original automatic statistician that, uh, that motivated auto GP. It's worked say 10 years ago. Um, so the auto automated statistician had the component, the natural language processing component, which is, you know, at the time there was no chat GPT or large language models.

So they just wrote some simple rules to take the learned Gaussian process. and summarize it in terms of a report. But now we have much more powerful language models. And one question could be, how can I use the outputs of AutoGP and integrate it within a language model, not only for reporting the structure, but also for answering now probabilistic queries.

So you might say, find for me a time when there could be a change point, or give me a numerical estimate of the covariance between two different time slices, or impute the data. between these two different time regions, or give me a 95 % prediction interval. And so a data scientist can write these in terms of natural language, or rather a domain specialist can write these in natural language, and then you would compile it into different little programs that are querying the GP learned by AutoGP.

And so creating some type of a higher level interface that makes it possible for people to not necessarily dive into the guts of Julia and, you know, or implement even an IPython notebook. but have the system learn the probabilistic models and then have a natural language interface which you can use to query those models, either for learning something about the structure of the data, but also for solving prediction tasks.

And in both cases, I think, you know, off the shelf models may not work so well because, you know, they may not know how to parse the auto GP kernel to come up with a meaningful summary of what it actually means in terms of the data, or they may not know how to translate natural language into Julia code for AutoGP. So there's a little bit of research into thinking about how do we fine tune these models so that they're able to interact with the automatically learned probabilistic models.

And I think what's, I'll just mention here, which is one of the benefits of an AutoGP like system is its interpretability. So because Gaussian processes are, they're quiet, transparent, like you said, they're ultimately at the end of the day, these giant multivariate normals. We can explain to people who are using these types of these distributions and they're comfortable with them, what exactly is the distribution that's been learned?

These are some weights and some giant neural network and here's the prediction and you have to live with it. Rather, you can say, well, here's our prediction and the reason we made this prediction is, well, we inferred a seasonal components with so -and -so frequency. And so you can get the predictions, but you can also get some type of interpretable summary for why those predictions were made, which maybe helps with the trustworthiness of the system. or just transparency more generally.

Yeah. I'm signing now. That sounds like an awesome tool. Yeah, for sure. That looks absolutely fantastic. And yeah, hopefully that will, these kind of tools will help. I'm definitely curious to try that now in my own models, basically. And yeah, see what... AutoGP .jl tells you, but the covariance structure and then try and use that myself in a model of mine, probably in Python so that I have to get out of the Julia and see how that, like how you can plug that into another model.

That would be super, super interesting for sure. Yeah. I'm going to try and find an excuse to do that. Um, actually I'm curious now, um, we could talk a bit about how that's done, right? How you do that discovery of the time series structure. And you've mentioned that you're using sequential Monte Carlo to do that. So SMC, um, can you give listeners an idea of what SMC is and why that would be useful in that case? Uh, and also if.

the way you do it for these projects differs from the classical way of doing SMC. Good. Yes, thanks for that question. So sequential Monte Carlo is a very broad family of algorithms. And I think one of the confusing parts for me when I was learning sequential Monte Carlo is that a lot of the introductory material of sequential Monte Carlo are very closely married to particle filters. But particle filtering, which is only one application of sequential Monte Carlo, isn't the whole story.

And so I think, you know, there's now more modern expositions of sequential Monte Carlo, which are really bringing to light how general these methods are. And here I would like to recommend Professor Nicholas Chopin's textbook, Introduction to Sequential Monte Carlo. It's a Springer 2020 textbook. I continue to use this in my research and, you know, I think that it's a very well -written overview of really how general and how powerful sequential Monte Carlo is.

So a brief explanation of sequential Monte Carlo. I guess maybe one way we could contrast it is the traditional Markov chain Monte Carlo. So in traditional MCMC, we have some particular latent state, let's call it theta. And we just, theta is supposed to be drawn from P of theta given X, where that's our posterior distribution and X is the data. And we just apply some transition kernel over and over and over again, and then we hope.

And the limit of the applications of these transition kernels, we're going to converge to the posterior distribution. Okay. So MCMC is just like one iterative chain that you run forever. You can do a little bit of modifications. You might have multiple chains, which are independent of one another, but sequential Monte Carlo is, is in a sense, trying to go beyond that, which is anything you can do in a traditional MCMC algorithm, you can do using sequential Monte Carlo.

But in sequential Monte Carlo, you don't have a single chain, but you have multiple different particles. And each of these different particles you can think of as being analogous in some way to a particular MCMC chain, but they're allowed to interact. And so you start with, say, some number of particles, and you start with no data. And so what you would do is you would just draw these particles from your prior distribution.

And each of these draws from the prior are basically draws from p of theta. And now I'd like to get them to p of theta given x. That's my goal. So I start with a bunch of particles drawn from p of theta, and I'd like to get them to p of theta given x. So how am I going to go from p of theta to p of theta given x? There's many different ways you might do that, and that's exactly what's sequential, right? How do you go from the prior to the posterior?

The approach we take in data in AutoGP is based on this idea of data tempering. So let's say my data x consists of a thousand measurements, okay? And I'd like to go from p of theta to p of theta given x. Well, here's one sequential strategy that I can use to bridge between these two distributions. I can start with P of theta, then I can start with P of theta given X1, then P of theta given X1 and X2, P of theta given X2 and X3. So I can anneal or I can temper these data points into the prior.

And the more data points I put in, the closer I'm going to get to the full posterior P of theta given X1 through a thousand or something. Or you might introduce these data in batch. But the key idea is that you start with draws from some prior typically. and then you're just adding more and more data and you're reweighting the particles based on the probability that they assign to the new data.

So if I have 10 particles and some particle is always able to predict or it's always assigning a very high score to the new data, I know that that's a particle that's explaining the data quite well. And so I might resample these particles according to their weights to get rid of the particles that are not explaining the new data well and to focus my computational effort on the particles that are explaining the data well. And this is something that an MCMC algorithm does not give us.

Because even if we run like a hundred MCMC chains in parallel, we don't know how to resample the chains, for example, because they're all these independent executions and we don't have a principled way of assigning a score to those different chains. You can't use the joint likelihood. That's not, it's not a valid or even a meaningful statistic to use to measure, to measure the quality of a given chain.

But SMC has, because it's built on importance sampling, has a principled way for us to assign weights to these different particles and focus on the ones which are most promising. And then I think the final component that's missing in my explanation is where does the MCMC come in? So traditionally in sequential Monte Carlo, there was no MCMC.

You would just have your particles, you would add new data, you would reweight it based on the probability of the data, then you would resample the particles. Then I'm going to add some... next batch of data, resample, re -weight, et cetera. But you're also able to, in between adding new data points, run MCMC in the inner loop of sequential Monte Carlo. And that does not sort of make the algorithm incorrect. It preserves the correctness of the algorithm, even if you run MCMC.

And there the intuition is that, you know, your prior draws are not going to be good. So now that after I've observed say 10 % of the data, I might actually run some MCMC on that subset of 10 % of the data before I introduce the next batch of data. So after you're reweighting the particles, you're also using a little bit of MCMC to improve their structure given the data that's been observed so far. And that's where the MCMC is run inside the inner loop.

So some of the benefits I think of this kind of approach are, like I mentioned at the beginning, in MCMC you have to compute the probability of all the data at each step. But in SMC, because we're sequentially incorporating new batches of data, we can get away with only looking at say 10 or 20 % of the data and get some initial inferences before we actually reach to the end and processed all of the observed data. So that's, I guess, a high level overview of the algorithm that AutoGP is using.

It's annealing the data or tempering the data. It's reassigning the scores of the particles based on how well they're explaining the new batch of data and it's running MCMC to improve their structure by applying these different moves like removing the sub -expression, adding the sub -expression, different things of that nature. Okay, yeah.

Thanks a lot for this explanation because that was a very hard question on my part and I think you've done a tremendous job explaining the basics of SMC and when that would be useful. So, yeah, thank you very much. I think that's super helpful. And why in this case, when you're trying to do these kind of time series discoveries, why... would SMC be more useful than a classic MCMC? Yeah. So it's more useful, I guess, for several reasons.

One reason is that, well, you might actually have a true streaming problem. So if your data is actually streaming, you can't use MCMC because MCMC is operating on a static data set. So what if I'm running AutoGP in some type of industrial process system where some data is coming in? and I'm updating the models in real time as my data is coming in.

That's a purely online algorithm in which SMC is perfect for, but MCMC is not so well suited because you basically don't have a way to, I mean, obviously you can always incorporate new data in MCMC, but that's not the traditional algorithm where we know its correctness properties. So for when you have streaming data, that might be extremely useful.

But even if your data is not streaming, you know, theoretically there's results that show that convergence can be much improved when you use the sequential Monte Carlo approach. Because you have these multiple particles that are interacting with one another. And what they can do is they can explore multiple modes whereby an MCMC, you know, each individual MCMC chain might get trapped in a mode.

And unless you have an extremely accurate posterior proposal distribution, you may never escape from that mode. But in SMC, we're able to resample these different particles so that they're interacting, which means that you can probably explore the space much more efficiently than you could with a single chain that's not interacting with other chains. And this is especially important in the types of posteriors that AutoGP is exploring, because these are symbolic expression spaces.

They are not Euclidean space. And so we expect there to be largely non -smooth components, and we want to be able to jump efficiently through this space through... the resampling procedure of, of, of SMC, uh, which, which is why, uh, which, which is why it's a suitable algorithm.

And then the third component is because, you know, this is more specific to GPs in particular, which is because GPs have a cubic cost of evaluating the likelihood in MCMC, that's really going to bite you if you're doing it each step. If I have a million, a thousand observations, I don't want to be doing that at each step, but in SMC, because the data is being introduced in batches, what that means is.

I might be able to get some very accurate predictions using only the first 10 % of the data, which is going to be quite cheap to evaluate the likelihood. So you're somehow smoothly interpolating between the prior, where you can get perfect samples, and the posterior, which is hard to sample, using these intermediate distributions, which are closer to one another than the distance between the prior and the posterior.

And that's what makes inference hard, essentially, which is the distance between the prior and the posterior. because SMC is introducing datasets in smaller batches, it's making this sort of bridging. It's making it easier to bridge between the prior and the posterior by having these partial posteriors, basically. Okay, I see. Yeah. Yeah, okay. That makes sense because of that batching process, basically. Yeah, for sure. And the requirements also of MCMC coupled to a GP that's...

That's for sure making stuff hard. Yeah. Yeah. And well, I've already taken a lot of time from you. So thanks a lot for us. I really appreciate it. And that's very, very fascinating. Everything you're doing. I'm curious also because you're a bit on both sides, right? Where you see practitioners, but you're also on the very theoretical side. And also you teach. So I'm wondering if like, what's the, in your opinion, what's the biggest hurdle in the Bayesian workflow currently?

Yeah, I think there's really a lot of hurdles. I don't know if there's a biggest one. So obviously, you know, Professor Andrew Gelman has enormous manuscript on the archive, which is called Bayesian workflow. And he goes through the nitty gritty of all the different challenges with coming up with the Bayesian model. But for me, at least the one that's tied closely to my research is where do we even start? Where do we start this workflow?

And that's really what drives a lot of my interest in automatic model discovery. probabilistic program synthesis. The idea is not that we want to discover the model that we're going to use for the rest of our, for the rest of the lifetime of the workflow, but come up with good explanations that we can use to bootstrap this process, after which then we can apply the different stages of the workflow. But I think it's getting from just data to plausible explanations of that data.

And that's what, you know, probabilistic program synthesis or automatic model discovery is trying to solve. So I think that's a very large bottleneck. And then I'd say, you know, the second bottleneck is the scalability of inference. I think that Bayesian inference has a poor reputation in many corners because of how unscalable traditional MCMC algorithms are.

But I think in the last 10, 15 years, we've seen many foundational developments in more scalable posterior inference algorithms that are being used in many different settings in computational science, et cetera. And I think... building probabilistic programming technologies that better expose these different inference innovations is going to help push Bayesian inference to the next level of applications that people have traditionally thought are beyond reach because of the lack of scalability.

So I think putting a lot of effort into engineering probabilistic programming languages that really have fast, powerful inference, whether it's sequential Monte Carlo, whether it's... Hamiltonian Monte Carlo with no U -turn sampling, whether it's, you know, there's really a lot of different, in volutive MCMC over discrete structure. These are all things that we've seen quiet recently. And I think if you put them together, we can come up with very powerful inference machinery.

And then I think the last thing I'll say on that topic is, you know, we also need some new research into how to configure our inference algorithms.

So, you know, we spend a lot of time thinking is our model the right model, but you know, I think now that we have probabilistic programming and we have inference algorithms maybe themselves implemented as probabilistic programming, we might think in a more mathematically principled way about how to optimize the inference algorithms in addition to optimizing the parameters of the model.

I think of some type of joint inference process where you're simultaneously using the right inference algorithm for your given model and have some type of automation that's helping you make those choices. Yeah, kind of like the automated statistician that you were talking about at the beginning of the show. Yeah, that would be fantastic. Definitely kind of like having a stats sidekick helping you when you're modeling. That would definitely be fantastic.

Also, as you were saying, the workflow is so big and diverse that... It's very easy to forget about something, forget a step, neglect one, because we're all humans, you know, things like that. No, definitely. And as you were saying, you're also a professor at CMU. So I'm curious how you approach teaching these topics, teaching stats to prepare your students for all of these challenges, especially given... challenges of probabilistic computing that we've mentioned throughout this show.

Yeah, yeah, that's something I think about frequently actually, because, you know, I haven't been teaching for a very long time and this is over the course of the next few years, gonna have to put a lot of effort into thinking about how to give students who are interested in these areas the right background so that they can quickly be productive.

And what's especially challenging, at least in my interest area, which is there's both the probabilistic modeling component and there's also the programming languages component. And what I've learned is these two communities don't talk much with one another. You have people who are doing statistics who think like, oh, programming language is just our scripts and that's really all it is. And I never want to think about it because that's the messy details.

But programming languages, if we think about them in a principled way and we start looking at the code as a first -class citizen, just like our mathematical model is a first -class citizen, then we need to really be thinking in a much more principled way about our programs.

And I think the type of students who are going to make a lot of strides in this research area are those who really value the programming language, the programming languages theory, in addition to the statistics and the Bayesian modeling that's actually used for the workflow.

And so I think, you know, the type of courses that we're going to need to develop at the graduate level or at the undergraduate level are going to need to really bring together these two different worldviews, the worldview of, you know, empirical data analysis, statistical model building, things of that sort, but also the programming languages view where we're actually being very formal about what are these actual systems, what they're doing, what are their semantics, what are their

properties, what are the type systems that are enabling us to get certain guarantees, maybe compiler technologies. So I think there's elements of both of these two different communities that need to be put into teaching people how to be productive probabilistic programming. researchers bringing ideas from these two different areas.

So, you know, the students who I advise, for example, I often try and get a sense for whether they're more in the programming languages world and they need to learn a little bit more about the Bayesian modeling stuff, or whether they're more squarely in Bayesian modeling and they need to appreciate some of the PL aspects better.

And that's the sort of a game that you have to play to figure out what are the right areas to be focusing on for different students so that they can have a more holistic view of probabilistic programming and its goals and probabilistic computing more generally, and building the technical foundations that are needed to carry forward that research. Yeah, that makes sense.

And related to that, are there any future developments that you foresee or expect or hope in probabilistic reasoning systems in the coming years? Yeah, I think there's quite a few. And I think I already touched upon one of them, which is, you know, the integration with language models, for example. I think there's a lot of excitement about language models. I think from my perspective as a research area, that's not what I do research in.

But I think, you know, if we think about how to leverage the things that they're good at, it might be for creating these types of interfaces between, you know, automatically learned probabilistic programs and natural language queries about these learned programs for solving tasks. data analysis or data science tasks. And I think this is an important, marrying these two ideas is important because if people are going to start using language models for solving statistics, I would be very worried.

I don't think language models in their current form, which are not backed by probabilistic programs, are at all appropriate to doing data science or data analysis. But I expect people will be pushing that direction. The direction that I'd really like to see thrive is the one where language models are interacting with probabilistic programs to come up with better, more principled, more interpretable reasoning for answering an end user question.

So I think these types of probabilistic reasoning systems, you know, will really make probabilistic programs more accessible on the one hand, and will make language models more useful on the other hand. That's something that I'd like to see from the application standpoint. From the theory standpoint, I have many theoretical questions, which maybe I won't get into. which are really related about the foundations of random variate generation.

Like I was mentioning at the beginning of the talk, understanding in a more mathematically principled way the properties of the inference algorithms or the probabilistic computations that we run on our finite precision machines. I'd like to build a type of complexity theory for these type or a theory about the error and complexity and the resource consumption of Bayesian inference in the presence of finite resources. And that's a much longer term vision, but I think it will be quite valuable.

once we start understanding the fundamental limitations of our computational processes for running probabilistic inference and computation. Yeah, that sounds super exciting. Thanks, Alain. That's making me so hopeful for the coming years to hear you talk in that way. I'm like, yeah, it's super stoked about the world that you are depicting here. And... Actually, it's so I think I still had so many questions for you because as I was saying, you're doing so many things.

But I think I've taken enough of your time. So let's call it to show. And before you go though, I'm going to ask you the last two questions I ask every guest at the end of the show. If you had unlimited time and resources, which problem would you try to solve? Yeah, that's a very tough question. I should have prepared for that one better.

Yeah, I think one area which would be really worth solving is using, or at least within the scope of Bayesian inference and probabilistic modeling, is using these technologies to unify people around data, solid data -driven inferences. to have better discussions in empirical fields, right? So obviously politics is extremely divisive. People have all sorts of different interpretations based on their political views and based on their aesthetics and whatever, and all that's natural.

But one question I think about, which is how can we have a shared language when we talk about a given topic or the pros and cons of those topic in terms of rigorous data -driven, or rigorous data -driven theses about why we have these different views and try and disconnect the fundamental tensions and bring down the temperature so that we can talk more about the data and have good insights or leverage insights from the data and use that to guide our decision -making across, especially the more

divisive areas like public policy, things of that nature. But I think part of the challenge is that why we don't do this, well, you know, From the political standpoint, it's much easier to not focus on what the data is saying because that could be expedient and it appeals to a broader amount of people.

But at the same time, maybe we don't have the right language of how we might use data to think more, you know, in a more principled way about some of the main, the major challenges that we're facing. So I, yeah, I think I'd like to get to a stage where we can focus more about, you know, principle discussions about hard problems that are really grounded in data.

And the way we would get those sort of insights is by building good probabilistic models of the data and using it to explain, you know, explain to policymakers why they shouldn't, they shouldn't do a different, a certain thing, for example. So I think that's a very important problem to solve because surprisingly many areas that are very high impact are not using real world inference and data to drive their decision -making.

And that's quite shocking, whether that be in medicine, you know, we're using very archaic. inference technologies in medicine and clinical trials, things of that nature, even economists, right? Like linear regression is still the workhorse in economics. We're using very primitive data analysis technologies. I'd like to see how we can use better data technologies, better types of inference to think about these hard, hard challenging problems. Yeah, couldn't agree more.

And... And I'm coming from a political science background, so for sure these topics are always very interesting to me, quite dear to me. Even though in the last years, I have to say I've become more and more pessimistic about these. And yeah, like I completely agree with your, like with the problem and the issues you have laid out and the solutions I am for now. completely out of them. Unfortunately, but yeah, like that I agree that something has to be done.

Because these kind of political debates, which are completely out of our out of the science, scientific consensus just so we are to me, I'm like, but I don't know, we've talked about that, you know, we've learned that I like, It's one of the things we know. I don't know what we're still arguing about that. Or if we don't know, why don't we try and find a way to, you know, find out instead of just being like, I know, but I'm right because I think I'm right and my position actually makes sense.

It's like one of the worst arguments like, oh, well, it's common sense. Yeah, I think maybe there's some work we have to do in having people trust. know, science and data -driven inference and data analysis more. That's about by being more transparent, by improving the ways in which they're being used, things of that nature, so that people trust these and that it becomes the gold standard for talking about different political issues or social issues or economic issues. Yeah, for sure.

But at the same time, and that's definitely something I try to do at a very small scale with these podcasts, It's how do you communicate about science and try to educate the general public better? And I definitely think it's useful. At the same time, it's a hard task because it's hard. If you want to find out the truth, it's often not intuitive. And so in a way you have to want it. It's like, eh. I know broccoli is better for my health long term, but I still prefer to eat a very, very fat snack.

I definitely prefer sneakers. And yet I know that eating lots of fruits and vegetables is way better for my health long term. And I feel it's a bit of a similar issue where it's like, I'm pretty sure people know it's long term better to... use these kinds of methods to find out about the truth, even if it's a political issue, even more, I would say, if it's a political issue.

But it's just so easy right now, at least given how the different political incentives are, especially in the Western democracies, the different incentives that are made with the media structure and so on. It's actually way easier to not care about that and just like, just lie and say what you think is true, then actually doing the hard work. And I agree. It's like, it's very hard.

How do you make that hard work look not boring, but actually what you're supposed to do and that I don't know for now. Yeah. Um, that makes me think like, I mean, I, I'm definitely always thinking about these things and so on. Something that definitely helped me at a very small scale, my scale where, because of course I'm always the, the scientists around the table. So of course, when these kinds of topics come up, I'm like, where does that come from? Right? Like, why are you saying that?

Where, how do you know that's true? Right? What's your level of confidence and things like that. There is actually a very interesting framework where, which can teach you how to ask. questions to actually really understand where people are coming from and how they develop their positions more than trying to argue with them about their position.

And usually it ties in also with the literature about that, about how to actually not debate, but talk with someone who has very entrenched political views. And it's called street epistemology. I don't know if you've heard of that. That is super interesting. And I will link to that in the show notes. So there is a very good YouTube channel by Anthony McNabosco, who is one of the main person doing straight epistemology. So I will link to that.

You can watch his video where he goes in the street literally and just talk about very, very hot topics to random people in the street. Can be politics. Very often it's about supernatural beliefs about... religious beliefs, things like this is really, these are not light topics. But it's done through the framework of street epistemology. That's super helpful, I find.

And if you want like a more, a bigger overview of these topics, there is a very good somewhat recent book that's called How Minds Change by David McCraney, who's got a very good podcast also called You're Not So Smart. So, Definitely recommend those resources. I'll put them in the show notes. Awesome. Well, for us, that was an unexpected end to the show. Thanks a lot. I think we've covered so many different topics. Well, actually, I still have a second question to ask you.

The second last question I ask you, so if you could have dinner with any great scientific mind, dead, alive, fictional, who would it be? I think I will go with Hercules Poirot, Agatha Christie's famous detective. So I read a lot of Hercules Poirot and I would ask him, because he's an inference, everything he does is based on inference. So I'd work with him to come up with a formal model of the inferences that he's making to solve very hard crimes. I am not.

That's the first time someone answers Hercules Poirot. But I'm not surprised as to the motivation. So I like it. I like it. I think I would do that with Sherlock Holmes also. Sherlock Holmes has a very Bayesian mind. I really love that. Yeah, for sure. Awesome. Well, thanks a lot, Ferris. That was a blast. We've talked about so many things. I've learned a lot about GPs. Definitely going to try AutoGP .jl.

Thanks a lot for all the work you are doing on that and all the different topics you are working on and were kind enough to come here and talk about. As usual, I will put resources and links to your website in the show notes for those who want to dig deeper and feel free to add anything yourself or for people. And on that note, thank you again for taking the time and being on this show. Thank you, Alex. I appreciate it. This has been another episode of Learning Bayesian Statistics.

Be sure to rate, review, and follow the show on your favorite podcatcher, and visit learnbaystats .com for more resources about today's topics, as well as access to more episodes to help you reach true Bayesian state of mind. That's learnbaystats .com. Our theme music is Good Bayesian by Baba Brinkman, fit MC Lass and Meghiraam. Check out his awesome work at bababrinkman .com. I'm your host, Alex and Dora. You can follow me on Twitter at Alex underscore and Dora like the country.

You can support the show and unlock exclusive benefits by visiting patreon .com slash LearnBasedDance. Thank you so much for listening and for your support. You're truly a good Bayesian change your predictions after taking information and if you think and I'll be less than amazing. Let's adjust those expectations. Let me show you how to be a good Bayesian Change calculations after taking fresh data in Those predictions that your brain is making Let's get them on a solid foundation

Transcript source: Provided by creator in RSS feed: download file