How does the world of statistical physics intertwine with machine learning, and what groundbreaking insights can this fusion bring to the field of artificial intelligence? In this episode, we'll delve into these intriguing questions with Marilou Gavrier. Having completed her doctorate in physics at Ecole Normale Supérieure, Marilou ventured to New York City for a joint postdoctoral appointment at New York University's Center for Data Science.
and the Flatirons Center for Computational Mathematics. As you'll hear, her research is not just about theoretical exploration, it also extends to the practical adaptation of machine learning techniques in scientific contexts, particularly where data are scarce. And this conversation will traverse the landscape of Marie-Lou's research, discussing her recent publications and her innovative approaches to machine learning challenges.
her inspirations, aspirations, and maybe even what she does when she's not decoding the complexities of machine learning algorithms. This is Learning Bayesian Statistics, episode 98, recorded November 23, 2023. Let me show you how to be a good lazy and change your predictions. Marie-Louis Gabrié, welcome to Learning Bayesian Statistics. Thank you very much, Alex, for having me. Yes, thank you. And thank you to Virgil, André and me for putting us in contact.
This is a French connection network here. So thanks a lot, Virgil. Thanks a lot, Marie-Lou for taking the time. I'm probably going to say Marie-Lou because it flows better in my English because saying Marie-Lou is and then continuing with English. I'm going to have the French accent, which nobody wants to hear that. So let's start.
So I gave a bit of... of your background in the intro to this episode, Marie-Lou, but can you define the work that you're doing nowadays and the topics that you are particularly interested in? I would define my work as being focused on developing methods and more precisely developing methods that use and leverage all the progress in machine learning for scientific computing. I have a special focus within this realm.
which is to study high-dimensional probabilistic models, because they really come up everywhere. And I think they give us a very particular lens on our world. And so I would say I'm working broadly in this direction. Well, that sounds like a lot of fun. So I understand why Virgil put me in contact with you. And could you start by telling us about your journey?
actually into the field of statistical physics and how it led you to merge these interests with machine learning and what you're doing today. Absolutely. My background is actually in physics, so I studied physics. Among the topics in physics, I quickly became interested in statistical mechanics. I don't know if all listeners would be familiar with statistical mechanics, but I would define it. broadly as the study of complex systems with many interacting components.
So it could be really anything. You could think of molecules, which are networks of interacting agents that have non-trivial interactions and that have non-trivial behaviors when put all together within one system. And I think it's really important, as I was saying, viewpoint of the world today to look at those big macroscopic systems that you can study probabilistically. And so I was quickly interested in this field that is statistical mechanics.
And at some point machine learning got the picture. And the way it did is that I was looking for a PhD in 2015. And I had some of my friends that were, you know, students in computer science and kind of early commerce to machine learning. And so I started to know that it existed. I started to know that actually deep neural networks were revolutionizing the fields, that you could expect a program to, I don't know, give names to people in pictures.
And I thought, well, if this is possible, I really wanna know how it works. I really want to, for this technology, not to sound like magic to me, and I want to know about it. And so this is how I started to become interested and to... find out that people knew how to make it work, but not how it worked, why it worked so well. And so this is how I, in the end, was put into contact with Florence Akala, who was my PhD advisor.
And I started to have this angle of trying to use statistical mechanics framework to study deep neural networks that are precisely those complex systems I was just mentioning, and that are so big that we are having trouble making really sense of what they are doing. Yeah, I mean, that must be quite... Indeed, it must be quite challenging. We could already dive into that. That sounds like fun. Do you want to talk a bit more about that project? Since then, I really shifted my angle.
I studied in this direction for, say, three, four years. Now, I'm actually going back to really the applications to real-world systems, let's say. using all the potentialities of deep learning. So it's like the same intersection, but looking at it from the other side. Now really looking at application and using machine learning as a tool, where I was looking at machine learning as my study, my object of study, and using statistical mechanics before.
So I'm keen on talking about what I'm doing now. Yeah. So basically you... You changed, now you're doing the other way around, right? You're studying statistical physics with machine learning tools instead of doing the opposite. And so how does, yeah, what does that look like? What does that mean concretely? Maybe can you talk about an example from your own work so that listeners can get a better idea? Yeah, absolutely.
So. As I was saying, statistical mechanics is really about large systems that we study probabilistically. And here there's a tool, I mean, that would be one of the, I would say, most active direction of research in machine learning today, which are generative models. And they are very natural because there are ways of making probabilistic model, but that you can control. That you have control.
produce samples from within one commons, where you are in need of very much more challenging algorithms if you want to do it in a general physical system. So we have those machines that we can leverage and that we can actually combine in our typical computation tools such as Markov chain Monte Carlo algorithms, and that will allow us to speed up the algorithms.
Of course, it requires some adaptation compared to what people usually do in machine learning and how those generative models were developed, but it's possible and it's fascinating to try to make those adaptations. Hmm. So, yeah, that's interesting because if I understand correctly, you're saying that one of your... One of the aspects of your... job is to understand how to use MCMC methods to speed up these models?
Actually, it's the other way around, is how to use those models to speed up MCMC methods. Okay. Can you talk about that? That sounds like fun. Yeah, of course.
Say MCMC algorithms, so Markov Chain Monte-Carlo's are really the go-to algorithm when you are faced with probabilistic models that is describing whichever system you care about, say it might be a molecule, and this molecule has a bunch of atoms, and so you know that you can describe your system, I mean at least classically, at the level of giving the Cartesian coordinates of all the atoms in your system. And then you can describe the equilibrium properties of your system.
by using the energy function of this molecule. So if you believe that you have an energy function for this molecule, then you believe that it's distributed as exponential minus beta the energy. This is the Boltzmann distribution. And then, okay, you are left with your probabilistic model. And if you want to approach it, a priori you have no control onto what this energy function is imposing as constraints. It may be very, very complicated. Well, go-to algorithm is Markov chain Monte Carlo.
And it's a go-to algorithm that is always going to work. And here I'm putting quotes around this thing. Because it's going to be a greedy algorithm that is going to be looking for plausible configurations next to other plausible configurations. And locally, make a search on the configuration space, try to visit it, and then. will be representative of the thermodynamics. Of course, it's not that easy.
And although you can make such locally, sometimes it's really not enough to describe fully probabilistic modeling, in particular, how different regions of your configuration space are related to one another. So if I come back to my molecule example, it would be that I have two different, let's say, conformations of my molecule, two main templates that my molecule is going to look like.
And they may be divided by what we call an energy barrier, or in the language of probabilities, it's just low probability regions in between large probability regions. And in this case, local MCMCs are gonna fail. And this is where we believe that generative models could help us. And let's say fill this gap to answer some very important questions. And how would that work then? Like you would... Would you run a first model that would help you infer that and then use that into the MCMC algorithm?
Or like, yeah, what does that look like? I think your intuition is correct. So you cannot do it in one go. And what's, for example, the paper that I published, I think it was last year in PNAS that is called Adaptive Monte Carlo Augmented with Normalizing Flows is precisely implementing something where you have feedback loops.
So The idea is that the fact that you have those local Monte-Carlo's that you can run within the different regions You have identified as being interesting Will help you to see the training of a generative model that is going to target generating configurations in those different regions Once you have this generative model you can include it in your mark of change strategy You can use it as a proposal mechanism to propose new locations for your MCMC to jump.
And so you're creating a Monte Carlo chain that is going to slowly converge towards the target distribution you're really after. And you're gonna do it by using the data you're producing to train a generative model that will help you produce better data as it's incorporated within the MCMC kernel you are actually jumping with. So you have this feedback mechanism that makes that things can work.
And this idea of adaptivity really stems from the fact that in scientific computing, we are going to do machine learning with scarce data. We are not going to have all the data we wish we had to start with, but we are going to have these type of methods where we are doing things in what we call adaptively. So it's doing, recording information, doing again. In a few words. Yeah. Yeah, yeah.
Yeah. So I mean, if I understand correctly, it's a way of going one step further than what HMC is already doing where we're looking at the gradients and we're trying to adapt based on that. Now, basically, the idea is to find some way of getting even more information as to where the next sample should come from. from the typical set and then being able to navigate the typical set more efficiently? Yes. Yes, so let's say that it's an algorithm that is more ambitious than HMC.
Of course, there are caveats. But HMC is trying to follow a dynamic to try to travel towards interesting regions. But it has to be tuned quite finely in order to actually end up in the next interesting region. provided that it started from one. And so to cross those energy barriers, here with machine learning, we would really be jumping over energy barriers. We would have models that pretty only targets the interesting regions and just doesn't care about what's in between.
And that really focuses the efforts where you believe it matters. However, there are cases in which those machine learning models will have trouble scaling where HMC would be more robust. So there is of course always a trade-off on the algorithms that you are using, how efficient they can be per MCMC step and how general you can accept them to be. Hmm. I see. Yeah. So, and actually, yeah, that would be one of my questions would be, when do you think this kind of new algorithm would be?
would be interesting to use instead of the classic and Chempsey? Like in which cases would you say people should give that a try instead of using the classic rubber state Chempsey method we have right now? So that's an excellent question. I think right now, so on paper, the algorithm we propose is really, really powerful because it will allow you to jump throughout your space and so to... to correlate your MCMC configurations extremely fast.
However, for this to happen, you have that the proposal that is made by your deep generative model as a new location, I mean, a new configuration in your MCMC chain is accepted. So in the end, you don't have anymore the fact that you are jumping locally and that your de-correlation comes from the fact that you are going to make lots of local jumps. Here you could correlate in one step, but you need to accept. So the acceptance will be really what you need to care about in running the algorithm.
And what is going to determine whether or not your acceptance is high is actually the agreement between your deep generative model and the target distribution you're after. And we have traditional, you know, challenges here in making the genetic model look like exactly the target we want. There are issues with scalability and there are issues with, I would say, constraints.
So you give me, let's say you're interested in Bayesian inference, so another case where we can apply these kind of algorithms, right? Because you have a posterior and you just want to sample from your posterior to make sense 10, 100. I tell you, I know how to train normalizing flows, which are the specific type of generative models we are using here, in 10 or 100 dimension.
So if you believe that your posterior is multimodal, that it will be hard for traditional algorithms to visit the entire landscape and equilibrate because there are some low density regions in between high density regions, go for it. If you... actually are an astronomer and you want to marginalize over your initial conditions on a grid that represents the universe and actually the posterior distribution you're interested in is on, you know, variables that are in millions of dimension. I'm sorry.
We're not going to do it with you and you should actually use something that is more general, something that will use a local search, but that is actually going to, you know, be Unperfect, right? Because it's going to be very, very hard also for this algorithm to work. But the magic of the machine learning will not scale yet to this type of dimensions. Yeah, I see. And is that an avenue you're actively researching to basically how to scale these algorithms better to be your scams?
Yeah, of course. Of course we can always try to do better. So, I mean, as far as I'm concerned, I'm also very interested in sampling physical systems. And in physical systems, there are a lot of, you know, prior information that you have on the system. You have symmetries, you have, I don't know, yeah, physical rules that you know that the system has to fulfill. Or maybe some, I don't know, multi-scale.
property of the probability distribution, you know that there are some self-significant similarities, you have information you can try to exploit in two ways, either in the sampling part, so you're having this coupled MCMC with the degenerative models, so either in the way you make proposals you can try to symmetrize them, you can try to explore the symmetry by any means. Oh, you can also directly put it in the generative model. So those are things that really are crucial.
And we understand very well nowadays that it's naive to think you will learn it all. You should really use as much information on your system as you may, as you can. And after that, you can go one step further with machine learning. But in non-trivial systems, it would be, I mean, it's not a big deal. deceiving to believe that you could just learn things. Yeah. I mean, completely resonate with that.
It's definitely something we will always tell students or clients, like, don't just, you know, throw everything at the model that you can and just try to pray that the model works like that. And, but actually you should probably use a generative perspective to try and find out what the best way of thinking about the problem is, what would be the good enough, simple enough model that you can come up with and then try to run that.
Yeah, so definitely I think that resonates with a lot of the audience where think generatively. And from what I understand from what you said is also trying to put as much knowledge and information as you have in your generative model. the deep neural network is here, the normalizing flow is here to help, but it's not going to be a magical solution to a suboptimally specified model. Yes, yes. Of course, in all those problems, what's hidden behind is the curse of dimensionality.
If we are trying to learn something in very high dimension and... It could be arbitrarily hard. It could be that you cannot learn something in high dimension just because you would need to observe all the location in this high dimension to get the information. So of course, this is in general not the case, because what we are trying to learn has some structure, some underlying structure that is actually described by fewer dimensions. And you actually need fewer observations to actually learn it.
But the question is, how do you find those structures, and how do you put them in? Therefore, we need to take into account as much as the knowledge we have on the system to make this learning as efficient as possible. Yeah, yeah, yeah. Now, I mean, that's super interesting. And that's your paper, Adaptive Monte Carlo augmented with normalizing floats, right? So this is the paper where we did this generally.
And I don't have yet a paper out where we are trying to really put the structure in the generative models. But that's the direction I'm actively Okay, yeah. I mean, so for sure, we'll put that paper I just seated in the show notes for people who want to dig deeper. And also, if by the time this episode is out, you have the paper or a preprint, feel free to add that to the show notes or just tell me and I'll add that to the show notes. That sounds really interesting for people to read.
And so I'm curious, like, you know, this idea of normalizing flows deep neural network to help MCMC sample faster, converge faster to the typical set. What was the main objective of doing that? I'm curious why did you even start thinking and working on that? So yes, I think for me, The answer is really this question of multimodality. So the fact that you may be interested in priority distribution for which it's very hard to connect the different interesting regions.
In statistical mechanics, it's something that we called actually metastability. So I don't know if it's a word you've already heard, but where some communities talk about multimodality, we talk about metastability. And metastability are at the heart of many interesting phenomena in physics. be it phase transitions. And therefore, it's something very challenging in the computations, but in the same time, very crucial that we have an understanding of.
So for us, it felt like there was this big opportunity with those probabilistic models that were so malleable, that were so, I mean, of course, hard to train, but then they give you so much. They give you an exact... value for the density that they encode, plus the possibility of sampling from them very easily, getting just a bunch of high-ID samples just in one run through a neural network.
So for us, there was really this opportunity of studying multimodal distribution, in particular, metastable systems from statistical mechanics with those tools. Yeah. Okay. So in theory, these normalizing flows are especially helpful to handle multimodal posterior. I didn't get that at first, so that's interesting.
Yep. That's really what they're going to offer you is the possibility to make large jumps, actually to make jumps within your Markov chain that can go from one location of high density to another one. just in one step. So this is what you are really interested in. Well, first of all, in one step, so you're going far in one step.
And second of all, regardless of how low is the density between them, because if you were to run some other type of local MCMC, you would, in a sense, need to find a path between the two modes in order to visit both of them. In our case, it's not true.
You're just completely jumping out of the blue thanks to... your normalizing flows that is trying to mimic your target distribution, and therefore that has developed mass everywhere that you believe matters, and that from which you can produce an IID sample wherever it supports very easily. I see, yeah. And I'm guessing you did some benchmarks for the paper?
So I think that's actually a very interesting question you're asking, because I feel benchmarks are extremely difficult, both in MCMC... and in deep learning. So, I mean, you can make benchmarks say, okay, I changed the architecture and I see that I'm getting something different. I can say, I mean, but otherwise, I think it's one of the big challenges that we have today.
So if I tell you, okay, with my algorithm, I can write an MCMC that is going to mix between the different modes, between the different metastable states. that's something that I don't know how to do by any other means. So the benchmark is, actually you won. There is nothing to be compared with, so that's fine.
But if I need to compare on other cases where actually I can find those algorithms that will work, but I know that they are going to probably take more iterations, then I still need to factor in a lot of things in my true honest benchmark. I need to factor in the fact that I run a lot of experiments to choose the architecture of my normalizing flow. I run a lot of experiments to choose the hyperparameters of my training and so on and so forth.
And I don't see how we can make those honest benchmarks nowadays. So I can make one, but I don't think I will think very highly that it's, I mean, you know, really revealing some profound truth about which solution is really working. The only way of making a known-use benchmark would be to take different teams, give them problems, and lock them in a room and see who comes out first with the solution. But I mean, how can we do that?
Well, we can call on listeners who are interested to do the experiments to contact us. That would be the first thing. But yeah, that's actually a very good point. And in a way, that's a bit frustrating, right? Because then it means at least experimentally, it's hard to differentiate between the efficiency of the different algorithms. So I'm guessing the claims that you make about this new algorithm being more efficient for multimodalities, theoretical underpinning of the algorithm?
No, I mean, it's just based on the fact that I don't know of any other algorithm, which under the same premises, which can do that. So, I mean, it's an easy way out of making any benchmark, but also a powerful one because I really don't know who to compare to. But indeed, I think then it's... As far as I'm concerned, I'm mostly interested in developing methodologies. I mean, that's just what I like to do.
But of course, what's important is that those methods are going to work and they are going to be useful to some communities that really have research questions that they want to answer. I mean, research or not actually could be engineering questions, decisions to be taken that require to do an MCMC. And I think the true tests of whether or not the algorithm is useful is going to be this, the test of time. Are people adopting the algorithms?
Are they seeing that this is really something that they can use and that would make their inference work where they could not find another method that was as efficient? And in this direction, there is the cross-collaborator, Case Wong, who is working at the Flatiron Institute and with whom we developed a package that is called FlowMC. that is written in Jax and that implements these algorithms. And the idea was really to try to write a package that was as user-friendly as possible.
So of course we have the time we have to take care of it and the experience we have as a region, you know, available softwares as we have, but we really try hard. And at least in this community of people studying gravitational waves, it seems that people are really trying, starting to use this in their research. And so I'm excited, and I think it is useful. But it's not the proper benchmark you would dream of. Yeah, you just stole one of my questions.
Basically, I was exactly going to ask you, but then how can people try these? Is there a package somewhere? So yeah, perfect. That's called FlowMC, you told me. Yes, it's called FlowMC. You can pip install FlowMC, and you will have it. If you are allergic to Jax... Right, I have it here. Yeah, there is a read the docs. So I'll put that in the show notes for sure. Yes, we have even documentation. That's how far you go when you are committed to having something that is used and useful.
So I mean, of course, we are also open to both comments and contributions. So just write to us if you're interested. Yeah, for sure. Yeah, that folks, if you are interested in contributing, if you see any bugs, make sure to open some issues on the GitHub repo or even better, contribute pull requests. I'm sure Marie-Doux and the co-authors will be very happy about that. Yes, you know typos in the documentation, all of this. Yeah, exactly.
That's what I... I tell everyone also who wants to start doing some open source package, start with the smallest PRs. You don't have to write a new algorithm, like already fixing typos, making the documentation look better, and stuff like that. That's extremely valuable, and that will be appreciated. So for sure, do that, folks. Do not be shy with that kind of stuff. So yeah, I put already the paper, you have out an archive at adaptive Monte Carlo and Flow MC, I put that in the show notes.
And yeah, to get back to what you were saying, basically, I think as more of a practitioner than a person who developed the algorithms, I would say the reasons I would... you know, adopt that kind of new algorithms would be that, well, I know, okay, that algorithm is specialized, especially for handling multimodels, multimodels posterior. So then I'd be, if I have a problem like that, I'll be like, oh, okay, yeah, I can use that. And then also ease of adoption.
So is there an open source package in which languages that can I just, you know, What kind of trade-off basically do I have to make? Is that something that's easy to adopt? Is that something that's really a lot of barriers to adoptions? But at the same time, it really seems to be solving my problem. You know what I'm saying? It's like, indeed, it's not only the technical and theoretical aspects of the method, but also how easy it is to... adopt in your existing workflows.
Yes. And for this, I guess it's, I mean, the feedback is extremely valuable because when you know the methods, you're really, it's hard to exactly locate where people will not understand what you meant. And so I really welcomed. No, for sure.
And already I find that absolutely incredible that now Almost all new algorithms, at least that I talk about on the podcast and that I see in the community, on the PMC community, almost all of them now, when they come up with a paper, they come out with an open source package that's usually installable in a Python, in the Python ecosystem. Which is really incredible.
I remember that when I started on these a few years ago, it was really not the norm and much more the exception and now almost The Icon Panning open source package is almost part of the paper, which is really good because way more people are going to use the package than read the paper. So, this is absolutely a fantastic evolution.
And thank you in the name of our soul to have taken the time to develop the package, clean up the code, put that on PyPI and making the documentation because That's where the academic incentives are a bit disaligned with what I think they should be. Because unfortunately, literally it takes time for you to do that. And it's not very much appreciated by the academic community, right? It's just like, you have to do it, but they don't really care.
We care as the practitioners, but the academic world doesn't really. And what counts is the paper. So for now, unfortunately, it's really just time that you take. out of your paper writing time. So I'm sure everybody appreciates it. Yes, but I don't know. I see true value to it. And I think, although it's maybe not as rewarded as it should, I think many of us see value in doing it. So you're very welcome. Yeah, yeah. No, for sure. Lots of value in it.
Just saying that value should be more recognized. Just a random question, but something I'm always curious about. I think I know the answer if I still want to ask. Can you handle sample discrete parameters with these algorithms? Because that's one of the grails of the field right now. How do you sample discrete parameters? So, okay, the pack, so what I've implemented, tested, is all on continuous space.
But, but what I need for this algorithm to work is a generative model of which I can sample from easily. IID, I mean, not I have to make a Monte Carlo to sample from my note that I can just in one Python comment or whichever language you want comment, gets an IID sample from. and that I can write what is the likelihood of this sample. Because a lot of generative models actually don't have tractable likelihoods.
So if you think, I don't know, of generative adversarial networks or variational entoencoders for people who might be familiar with those very, very common generative models, they don't have this property. You can generate samples easily, but you cannot write down with which density of probability you've generated this sample.
This is really what we need in order to use this generative model inside a Markov chain and inside an algorithm that we know is going to converge towards the target distribution. So normalizing flows are playing this role for us with continuous variables. They give us easy sampling and easy evaluation of the likelihood. But you also have equivalence on discrete distributions.
And if you want... generative model that would have those two properties on discrete distribution, you should turn yourself to autoregressive models. So I don't know if you've learned about them, but the idea is just that they use a factorization of probability distributions that is just with conditional distributions.
And that's something that is in theory has full expressivity, that any distribution can be written as a factorized distribution where you are progressively on the degrees of freedom that you have already sampled. And you can rewrite the algorithm, training an autoregressive model in the place of a normalizing flow. So honest answer, I haven't tried, but it can be done. Well, it can be done.
And now that I'm thinking about it, people have done it because in statistical mechanics, there are a lot of systems that we like. a lot of our toy systems that are binary. So that's, for example, the Ising model, which are a model of spins that are just binary variables. And I know of at least one paper where they are doing something of this sort.
So making jumps, they're actually not trying to refresh full configurations, or they are doing two, both refreshing full configurations and partial configurations. And they are doing... something that, in essence, is exactly this algorithm, but with discrete variables. So I'll happily add the reference to this paper, which is, I think, it's by the group of Giuseppe Carleo from EPFL.
And OK, I haven't, I don't think they train exactly like, so it's not exactly the same algorithm, but things around this have been tested. OK, well, it sounds like a. Sounds like fun, for sure. Definitely something I'm sure lots of people would like to test. So folks, if you have some discrete parameters somewhere in your models, maybe you'll be interested by normalizing flows. So the flow in C package is in the show notes. Feel free to try it out.
Another thing I'm curious about is how do you run the typical network, actually? And how much of a bottleneck is it on the sampling time, if any? Yes. So it will definitely depend on the space. No, let me rewrite. The thing is, whether or not it's going to be worth it to train a neural network in order to help you sampling. depends on how difficult this for you to sample in, I mean, with the more traditional MCMCs that you have on your hand.
So again, if you have a multimodal distribution, it's very likely that your traditional MCMC algorithms are just not going to cut it. And so then, I mean, if you really care about sampling this posterior distribution or this distribution of configurations of a physical system, then you will be willing to pay the price on this sampling.
So instead of, say, having to use a local sampler that will take you billions of iterations in order to see transitions between the modes, you can train a normalizing flow on the autoregressive model if you're discrete, and then have those jumps happening every other time. Then it's more than clear that it's worth doing it. OK, yeah, so the answer is it depends quite a lot. Of course, of course. Yeah, yeah. And I guess, how does it scale with the quantity of parameters and quantity of data?
So quantity of parameters, it's really this dimension I was already discussing a bit about and telling you that there is a cap on what you can really expect these methods will work on. I would say that if the quantity of parameters is something like tens or hundreds, then things are going to work well, more or less out of the box. But if it's larger than this, you will likely run into trouble.
And then the number of data is actually something I'm less familiar with because I'm less from the Bayesian communities than the stat-mech community to start with. So my distribution doesn't have data embedded in them, in a sense, most of the time. But for sure, what people argue, why it's a really good idea to use generative models such as normalizing flows to sample in the Bayesian context. is the fact that you have an amortization going on. And what do I mean by that?
I mean that you're learning a model. Once it's learned, it's going to be easy to adjust it if things are changing a little. And with little adjustments, you're going to be able to sample still a very complicated distribution. So say you have data that is arriving online, and you keep on having new samples to be added to your posterior distribution.
then it's very easy to just adjust the normalizing flow with a few training iterations to get back to the new posterior you actually have now, given that you have this amount of data. So this is what some people call amortization, the fact that you can really encapsulate in your model all the knowledge you have so far, and then just adjust it a bit, and don't have to start from scratch, as you would have to in other. Monte Carlo methods.
Yeah. Yeah, so what I'm guessing is that maybe the tuning time is a bit longer than a classic HMC. But then once you're out of the tuning phase, the sampling is going to be way faster. Yes, I think that's a correct way of putting it. And otherwise, for the kind of the number of, I mean, the dimensionality that the algorithm is comfortable with.
In general, the running times of the model, how have you noticed that being like, has that been close to when you use a classic HMC or is it something you haven't done yet? I don't think I can honestly answer this question. I think it will depend because it will also depend how easily your HMC reaches all the regions you actually care about. So I mean, probably there are some distributions that are very easy for HMC to cover and where it wouldn't be worth it to train the model.
But then plenty of cases where things are the other way around. Yeah, yeah, yeah. Yeah, I can guess. That's always something that's really fascinating in this algorithm world is how dependent everything is on the model. use case, really dependent on the model and the data. So on this project, on this algorithm, what are the next steps for you? What would you like to develop next on this algorithm precisely?
Yes, so as I was saying, one of my main questions is how to scale this algorithm and We kind of wrote it in an all-purpose fashion. And all-purpose is nice, but all-purpose does not scale. So that's really what I'm focusing on, trying to understand how we can learn structures we can know or we can learn from the system, how to explore them and put them in, in order to be able to tackle more and more complex systems with higher, I mean, more degrees of freedom.
So more parameters than what we are currently doing. So there's this. And of course, I'm also very interested in having some collaborations with people that care about actual problem for which this method is actually solving something for them. As it's really what gives you the idea of what's next to be developed, what are the next methodologies that's will be useful to people? Can they already solve their problem? Do they need something more from you?
And that's the two things I'm having a look at. Yeah. Well, it definitely sounds like fun. And I hope you'll be able to work on that and come up with some new, amazing, exciting papers on this. I'll be happy to look at that. And so that's it. It was a great deep dive on this project. And thank you for indulging on my questions, Marilou. Now, if we want to de-zoom a bit and talk about other things you do, you're also interested to mention that in the context of scarce data.
So I'm curious on what you're doing on these, if you could elaborate a bit. Yes, so I guess what I mean by scarce data is precisely that when we are using machine learning in scientific computing, usually what we are doing is exploiting the great tool that are deep neural networks to play the role of a surrogate model somewhere in our scientific computation. But most of the time, this is without data a priori. We know that there is a function we want to approximate somewhere.
But in order to have data, either we have to pay the price of costly experiments, costly observations, or we have to pay the price of costly numerics. So if you, I mean, a very famous example of applications of machine learning through scientific computing is molecular dynamics and quantum precision. So this is what people call density functional theory. So if you want to.
observe the dynamics of a molecule with the accuracy of what's going on really at the level of quantum mechanics, then you have to make very, very costly call to a function that predicts what's the energy predicted by quantum mechanics and what are the forces predicted by quantum mechanics. So people have seen here an opportunity to use deep neural nets in order to just regress what's the value of this quantum potential. at the different locations that you're going to visit.
And the idea is that you are creating your own data. You are deciding when you are going to pay the price of do the full numerical computation and then obtain a training point of given Cartesian coordinates, what is the value of this energy here. And then you have to, I mean, conversely to what you're doing traditionally in machine learning, where you believe that you have... huge data sets that are encapsulating a rule, and you're going to try to exploit them at best.
Here, you have the choice of where you create your data. And so you, of course, have to be as smart as possible in order to have to create as little as possible training points. And so this is this idea of working with scarce data that has to be infused in the usage of machine learning in scientific computing.
My example of application is just what we have discussed, where we want to learn a deep generative model, whereas what we start, we just have our target distribution as an objective, but we don't have any sample from it. That would be the traditional data that people will be using in generative modeling to train a generative model. So if you want, we are playing this adaptive game. I was already a bit eating at.
where we are creating data that is not exactly the data we want, but that we believe is informative of the data we want to train the generative model that is in turn going to help us to convert the MCMC and in the same time as you are training your model, generate the data you would have needed to train your model. Yeah, that is really cool. And of course I asked about that because scarce data is something that's extremely common in the Bayesian world.
That's where usually Bayesian statistics from the yeah, helpful and useful because when you don't have a lot of data, you need more structure and more priors. So if you want to say anything about your phenomenon of interest. So that's really cool that you're working on that. I love that. And from also, you know, a bit broader perspective, you know, MCMC really well. We work on it a lot. So I'm curious where you think MCMC is heading in the next few years.
And if you see its relevance waning in some way. Well, I don't think MCMC can go out of fashion in a sense because it's absolutely ubiquitous. So practical use cases are everywhere. If you have a large probabilistic model, usually it's given to you by the nature of the problem you want to study. And if you cannot choose anything about putting in the right properties, you're just going to be. you know, left with something that you don't know how to approach except by MCMC.
So it's absolutely ubiquitous as an algorithm for probabilistic inference. And I would also say that one of the things that are going to, you know, keep MCMC going for a long time is how much it's a cherished object of study by actually researchers from different communities, because I mean...
You can see people really from statistics that are kind of the prime researchers on, okay, how should you make a Monte Carlo method that has the best convergence properties, the best speed of convergence, and so on and so forth. But you can also see that the fields where those algorithms are used a lot, be it statistical mechanics, be it Bayesian inference, also have full communities that are working on developing MCMCs.
And so I think it's really a matter that they are an object of curiosity and in training to a lot of people. And therefore it's something that's for now is still very relevant and really unsolved. I mean, something that I love about MCMC is that when you look at it first, you say, yeah, that's simple, you know? Yeah. Yes, that's, but then you start thinking about it. Then you... I mean, realize how subtle are all the properties of those algorithms.
And you're telling yourself, but I cannot believe it's so hard to actually sample from distributions that are not that complicated when you're a naive newcomer. And so, yeah, I mean, for now, I think they are still here and in place. And if I could even comment a bit more regarding exactly the context of my research, where it could seemingly be the case that I'm trying to replace MCMC's with machine learning. I would warn the listeners that it's not at all what we are concluding.
I mean, that's not at all the direction we are going to. It's really a case where we need both. That MCMC can benefit from learning, but learning without MCMC is never going to give you something that you have enough guarantees on, that something that you can really trust for sure. So I think here there is a really nice combination of MCMC and learning and that they're just going to nutter each other and not replace one another. Yeah, yeah, for sure.
And I really love the, yeah, that these projects of trying to make basically MCMC more informed instead of having first random draws, you know, almost random draws with Metropolis in the end. making that more complicated, more informed with the gradients, with HMC, and then normalizing flows, which try to squeeze a bit more information out of the structure that you have to make the sampling go faster. I found that one super useful.
And also, yeah, that's also a very, very fascinating part of the research. And this is part also of a lot of the research a lot of initiatives that you have focused on, right? Personally, basically how that we could decry it like a machine learning assisted scientific computing. You know, and do you have other examples to share with us on how machine learning is helping traditional scientific computing methods?
Yes. So, for example, I was giving already the example of of the learning of the regression of the potentials of molecular force fields in people that are studying molecules. But we are seeing a lot of other things going on. So there are people that are trying to even use machine learning as a black box in order to, how should I say, to make classifications between things they care about. So for example, you have samples that come from a model.
But you're not sure if they come from this model or this other one. You're not sure if they are above a critical temperature or below a critical temperature, if they belong to the same phase. So you can really try to play this game of creating an artificial data set where you know what is the answer, train a classifier, and then use your black box to tell you when you see a new configuration which type of configuration it is. And it's really.
given to you by deep learning because you would have no idea why the neural net is deciding that it's actually from this or from this. You don't have any other statistics that you can gather and that will tell you what's the answer and this is why. But it's kind of like opening this new conceptual door that sometimes there are things that are predictable. I mean, you can check that, okay, on the data that you know the answer of the machine is extremely efficient.
But then you don't know why things are happening this way. I mean, there's this, but there are plenty of other directions. So people that are, for example, using neural networks to try to discover a model. And here, model would be actually what people call partial differential equations, so PDEs. So I don't know if you've heard about those physics-informed neural networks. But there are neural networks that people are training, such that they are solution of a PDE.
So instead of actually having training data, what you do is that you use the properties of the deep neural nets, which are that they are differentiable with respect to their parameters, but also with respect to their inputs. And for example, you have a function f. And you know that the laplation of f is supposed to be equal to. the derivative in time of f, well, you can write mean squared loss on the fact that the laplacian of your neural network has to be close to its derivative in time.
And then, given boundary conditions, so maybe initial condition in time and boundary condition in space, you can ask a neural net to predict the solution of the PDE. And even better, you can give to your learning mechanism a library of term that would be possible candidates for being part of the PDE. And you can let the network tell you which terms of the PDE in the library are actually, seems to be actually in the data you are observing.
So, I mean, there are all kinds of inventive way that researchers are now using the fact that deep neural nets are differentiable. smooth, can generalize easily, and yes, those universal approximators. I mean, seemingly you can use neural nets to represent any kind of function and use that inside their computation problems to try to, I don't know, answer all kinds of scientific questions. So it's, I believe, pretty exciting. Yeah, yeah, that is super fun.
I love how You know, these comes together to help on really hard sampling problems like sampling ODE's or PDE's, just extremely hard. So yeah, using that. Maybe one day also we'll get something for GPs. I know the Gaussian processes are a lot of the effort is on decomposing them and finding some useful algebraic decompositions, so like the helper space, Gaussian processes that Bill Engels especially has added to the PrimeC API, or eigenvalue decomposition, stuff like that.
But I'd be curious to see if there are also some initiatives on trying to help the conversion of Gaussian processes using probably deep neural networks, because there is a mathematical connection between neural networks and GPs. I mean, everything is a GP in the end, it seems. So yeah, using a neural network to facilitate the sampling of a Gaussian process would be super fun. So I have so many more questions. But when I be mindful of your time, we've already been recording for some time.
So I try to make my thoughts more packed. But something I wanted to ask you You teach actually a course in Polytechnique in France that's called Emerging Topics in Machine Learning. So I'm curious to hear you say what are some of the emerging topics that excite you the most and how do you approach teaching them? So in this class, it's actually the nice class where we have a wild card to just talk about whatever we want.
So as far as I'm concerned, I'm really teaching about the last point that we discussed, which is how can we hope to use the technology of machine learning to assist scientific computing. And I have colleagues that are jointly teaching this class with me that are, for example, teaching about optimal transport or about private and federated learning. So it can be different topics.
But we all have the same approach to it, which is to introduce to the students the main ideas quite briefly and then to give them the opportunity to learn, to read papers that we believe are important or at least really illustrative of those ideas and the direction in which the research is going and to read these papers, of course, critically. So the idea is that we want to make sure that they are understood. We also want them to implement the methods.
And once you implement the methods, you realize everything that is sometimes under the rug in the paper. So where is it really difficult? Where the method is really making a difference? And so on and so forth. So that's our approach to it. Yeah, that must be a very fun course. At which level do you teach that? So our students are third year at Ecole Polytechnique. So that would be equivalent to the first year of graduate program.
Yeah. And actually, looking forward, what do you think are the most promising areas of research in what you do? So basically, interaction of machine learning and statistical physics. Well, I think something that actually has been and will continue being a very, very fruitful field between statistical mechanics and machine learning are generative models.
So you probably heard of diffusion models, and there are new kind of generative models that are relying on learning how to reverse a diffusion process, a diffusion process that is noising the data. once you've learned how to reverse it, will allow you to transform noise into data. It's something that is really close to statistical mechanics because the diffusion really comes from studying brilliant particles that are all around us. And this is where this mathematics comes from.
And this is still an object of study in the field of statistical mechanics. And you've served a lot of machine learning models. I could also cite Boltzmann machines. I mean, they have even the name of the father of statistical mechanics, Boltzmann. And it's here again, I mean, something where it's really inspiration from the model studied by physicists that gave the first forms of models that were used by machine learner in order to do density estimation.
So there is really this cross-fatalization has been here for, I guess, the last 50 years. The field of machine learning has really emerged in the communities. And I'm hoping that my work and all the groups that are working in this direction are also going to demonstrate the other way around, that generative models can help also a lot in statistical mechanics. So that's definitely what I am looking forward to.
Yeah. Yeah, I love that and understand why you're talking about that, especially now with the whole conversation we've had. That your answer is not surprising to me. Actually, something also that I mean, even broader than that, I'm guessing you already care a lot about these questions from what I get, but if you could choose the questions you'd like to see the answer to before you die, what would they be? That's obviously a very vast question.
If I stick to a bit really this... what we've discussed about the sampling problems and where I think they are hard and why they are so intriguing. I think that something I'm very keen on seeing some progress around is this question of sampling multimodal distributions but have come up with guarantees. Here, there's really, in a sense, sampling a multimodal distribution could be just judged. undoable. I mean, there is some NP-hardness that is hidden somewhere in this picture.
So of course, it's not going to be something general, but I'm really wondering, I mean, I'm really thinking that there should be some assumption, some way of formalizing the problem under which we could understand how to construct algorithms that will probably, you know, succeed in making this something happen. And so here, I don't know, it's a theoretical question, but I'm very curious about what we will manage to say in this direction.
Yeah. And actually that sets us up, I think, for the last two questions of the show. So, I mean, I have other questions, but already I've been recording for a long time. So I need to let you go and have dinner. I know it's late for you. So let me ask you the last two questions. I ask every guest at the end of the show. First one. If you had unlimited time and resources, which problem would you try to solve?
I think it's an excellent question because it's an excellent opportunity maybe to say that we don't have unlimited resources. I think it's probably the biggest challenge we have right now to understand and to collectively understand because I think now we individually understand that we don't have unlimited resources.
And in a sense the... the biggest problem is how do we move this complex system of human societies we have created in order to move within the direction where we are using precisely less resources. And I mean, it has nothing to do with anything that we have discussed before, but it feels to me that it's really where the biggest question is lying that really matters today. And I have no clue how to approach it. But I think it's actually what matters.
And if I had a limit in time and resources, that's definitely what I would be researching towards. Yeah. Love that answer. And you're definitely in good company. Lots of people have talked about that for this question, actually. And second question, if you could have dinner with any great scientific mind, dead, alive, or fictional, who would it be? So, I mean, a logic answer with my last response is actually Grotendieck.
So, I don't know, you probably know about this mathematician who, I mean, was somebody worried about, you know, our relationship to the world, let's say, as scientists very early on, and who had concluded that to some extent we should not be doing research. So... I don't know that I agree, but I also don't think it's obviously wrong. So I think it would be really probably one of the most interesting discussion to be added on top that he was a fantastic speaker.
And I do invite you to listen to his conferences and that it would be really fascinating to have this conversation. Yeah. Great. Great answer. You know, definitely the first one to answer Grotendic. But that'd be cool. Yeah. If you have a favorite conference of him, feel free to put that in the show notes for listeners, I think it's going to be really interesting and fun for people. Might be in French, but... I mean, there are a lot of subtitles now.
If it's in YouTube, it's doing a pretty good job at the automated transcription, especially in English. So I think it will be okay. And that will be good for people's French lessons. So yeah, you know, two birds with one stone. So definitely include that now. Awesome, Marie-Lou. So that was really great. Thanks a lot for taking the time and being so generous with your time. I'm happy because I had a lot of questions, but I think we did a pretty good job at tackling most of them.
As usual, I put resources and a link to your website in the show notes for those who want to dig deeper. Thank you again, Marie-Lou, for taking the time and being on this show. Thank you so much for having me.