#96 Pharma Models, Sports Analytics & Stan News, with Daniel Lee

00:01

Let me show you how to be a good peasy and change your production. Getting Daniel Lee on the show is a real treat. With 20 years of experience in numeric computation, 10 years creating and working with Stan, 5 years working on pharma-related models, you can ask him virtually anything. And that I did, my friends. From joint models for estimating oncology treatment efficacy to PKPD models, from data fusion for US Navy applications to baseball and football analytics.

00:30

as well as common misconceptions or challenges in the Bajan world, our conversation spans a wide range of topics that I am sure you will appreciate. Daniel studied mathematics at MIT and statistics at Cambridge University, and when he's not in front of his computer, he's a savvy basketball player and a hip-hop DJ. You actually have his Soundcloud profile in the show notes if you're curious. This is Learning Bajan Statistics, episode 96. recorded October 12, 2023. Hello, my dear patients.

01:08

Some of you may know that I teach workshops at Pimesy Labs to help you jumpstart your basic journey, but sometimes the fully live version isn't a fit for you. So we are launching what we call the Guided Learning Path. This is an extensive library of video courses handpicked from our live workshops. that unlocks asynchronous learning for you.

01:31

From A-B testing to Gaussian processes, from hierarchical models to causal inference, you can explore it all at your own pace, on your own schedule, with lifetime access. If that sounds like fun and you too want to become a vision modeler, feel free to reach out at alex.andorra at primec-labs.com. And now, let's get nerdy with Daniel Lee. Daniel Lee, welcome to Learning Bayesian Statistics. Hello. Yeah, thanks a lot for taking the time.

02:06

I'm really happy to have you on the show because I've followed your work for quite a long time now and I've always thought that it'd be fun to have you on the show. And today was the opportunity. So thank you so much for taking the time. And so let's start writing. What are you doing? How would you define the work you're doing nowadays? And what are the topics you are particularly interested in? Yeah, so I just joined Zealous Analytics recently.

02:41

They're a company that does sports analytics, mostly for professional teams. Although they're expanding to amateur college teams as well. And what I get to do is... look at data and try to project how well players are going to do in the future. That's the bulk of what I'm focused on right now. That sounds like fun. Were you already a sports fan or is it that mainly you're a good modeler and that was a fun opportunity that presented itself? Yeah, I think both are true.

03:22

I grew up playing a lot of basketball. I coached a little bit of basketball. Um, yeah. So I feel like I know the subject matter of basketball pretty well. The other sports I know very little about, but, um, uh, you know, combine that with being able to model data. It's actually a really cool opportunity. Yeah. And actually, how did you end up doing what you're doing today? Because. I know you've got a very, very senior path.

03:57

So I'm really interested also in your kind of origin story because, well, that's an interesting one. So how did you end up doing what you're doing today? Yeah. So sports ended up happening because I don't know, it actually started through stand. I didn't really have... an idea that I'd be working in sports full-time professionally until this opportunity presented itself.

04:29

And what ended up happening was I met the founders of Zealous Analytics independently about a decade ago and the company didn't start till 2019. So, you know, met them. Luke was at Harvard. Dan was at NYU and Doug at the time was going to the Dodgers. And I talked to them independently about different things and, you know, fast forward about 10 years and I happened to be free. This opportunity came up. They're using Stan inside. They're using a bunch of other stuff too, but it was a good time.

05:11

And do you remember how you first got introduced to Bayesian methods and also why they stuck with you? Yeah. So there are actually two different times that I got introduced to Bayesian methods. The first was I was working in San Diego. This is after my undergraduate degree. We were working on trying to estimate when hardware would fail and we're talking about modems and things that go with satellite dishes.

05:46

So they happen to be somewhere that's hard to spread across and when one of those pieces go down, it's actually very costly to repair, especially when you don't have a part available. So we started using graphical models and using something called Weka to build graphical models and do Bayesian computation. This was all done using graphical models and it was all discrete. That was the first time I got introduced to Bayesian statistics. It was very simple at the time.

06:25

What ended up happening after that was I went to grad school at Cambridge, did part three mathematics and ended up taking all the stats courses. And that's where I really saw Bayesian statistics, learned MCMC, learned how bugs was built using the graphical models and conjugacy. And then... Yeah, so that was the real introduction to Bayesian modeling.

06:54

Yeah. And actually I'm curious because, especially in any content basically where we talk about, so how do you end up doing what you're doing and stuff like that, there is kind of a hindsight it looks obvious how you ended up doing what you're doing. And that almost seems easy. But I mean, at least in my case, that wasn't, you know, it's like you always have obstacles along the way and so on, which is not necessarily negative, right?

07:36

We have that really good saying in French that says basically, what's the obstacle, the obstacles in front of you makes you grow, basically. It's a very hard thing to translate, but basically that's the substance. So yeah, I'm just curious about your own path. How senior was it to get to where you are right now? I've always believed in learning from failures or learning from experiences where you don't succeed. That's where you gain the most knowledge.

08:27

That's where you get to learn where your boundary is. If you want to know about the path to how I became where I'm at now, let's see. I guess I could go all the way back to high school. I grew up just outside of Los Angeles. In high school... I had a wonderful advisor named Sanzha Kazadi. He was a PhD student at Caltech and he ran a research program for high school kids to do basic research. So starting there, I learned to code and was working on the traveling salesman problem.

09:08

From there, I went to MIT, talking about failures. I tried to be a physics major going in. I failed physics three times in the first year, so I couldn't. I ended up being a math major. And it was math with computer science, so it was really close to a theoretical computer science degree, doing some operational research as well. At the end of MIT, I wasn't doing so well in school. I was trying to apply to grad school, and that wasn't happening. Got a job in San Diego. MIT alum hired me.

09:46

That's where I started working for three and a half years in software, a little bit of computation. So a lot of it was translating algorithms to production software, working on algorithms and went through a couple of companies with the same crew, but we just kind of bounced around a little bit. At the end of that, I ended up going back to Cambridge for... a one year program called part three mathematics. It's also a master's degree. I got there not knowing anything about Cambridge.

10:26

I didn't do enough research, obviously. For the American viewers, people, the system is completely different. There's no midterms, no nothing. You have three trimesters. You take classes in each of them and you take two weeks of exams at the end. And that determines your fate. And, um, I got to Cambridge and I couldn't even understand anything in the syllabus other than the stuff in statistics. Mind you, I hadn't done an integral in three years, right? Integral derivative.

11:04

I didn't know what the normal distribution was. And I go to Cambridge. Those are the only things I can read. So I'm teaching myself. Um, measure theory while learning all these new things that I've never seen and managed to squeak out passing. So happy. At the end of that, I asked David Spiegelhalter, who happened to just come back to Cambridge, that was his first year back in the stats department, who I should talk to.

11:35

This is, so when I say I learned bugs, he's, he had a course on applied Beijing statistics, which was taught in wind bugs. And he would literally show us which buttons to click and in which order, in order for it not to crash. So that was fun. But he told me, he told me I should talk to Andrew Gelman. Um, so I ended up, uh, talking to Andrew Gelman and working with Andrew from 2009 to 2016 and that's how I really got into Beijing stats. Um, After Cambridge, I knew theory. I hadn't seen any data.

12:15

Working for Andrew, I saw a bunch of data and actually how to really work with data. Since then I've run a startup. We try to take Stan. So Stan's an open source probabilistic programming language. In 2017, a few of us thought there was a good opportunity for making a business around it. very much like time C labs. And, you know, we try to make a horizontal platform for it. And at that time, there wasn't enough demand.

12:49

So we pivoted and ended up estimating models for writing very complicated models and estimating things for the farm industry. And then since then I've like I left the company in 2020 at the end of 2021. I consulted for a bit, just random projects, and then ended up with Celus. So that's how I got to today. Yeah. Man. Yeah, thanks a lot for that exhaustive summary, I'd say, because that really shows how random usually paths are, right?

13:29

And I find that really inspiring also for people who are a bit upstream in their carrier path. could be looking at you as a role model and could be intimidated by thinking that you had everything figured out from when you were 18 years old, right? Just getting out of high school, which was not the case from what I understand. And that's really reassuring and inspiring, I think, for a lot of people. Yeah, definitely not.

14:03

I could tell you going to career fairs at the end of my undergraduate degree, people will look at my math degree and not even really look at my resume. Because my GPA was low, my grades were bad as a student, and also, who needs a bad mathematician? That makes no sense anywhere. So that limited what I was doing, but at the end it all worked out. Yeah, yeah, yeah. Now you made an agreement in a way, our path, our seminar, except for me, that was a GPA in business school.

14:38

So business school and political science. Political science, I did have decent grades. Business school, it really depended on what the course was about. Because when I was not interested in the course, yeah, that showed. For sure, that showed in the GPA. But yeah, and I find that also super interesting because in your path, there is also so many amazing people you've met along the way and that it seems like these people were also your mentors at some point.

15:16

So Yeah, do you want to talk a bit more about that? Yeah, I've um, I've been really fortunate You know as I was going through so Not you know, I haven't had very many formal mentors that were great and by that I mean like advisors that were assigned to me through schools. They tend to see what I do and discount my abilities because of my inability to do really well at school. So that's what it is. But there were a bunch of people that really did sort of shape my career.

16:01

The, you know, working for Andrew Gelman was great. He's, um, he trusted me. Like he, for me, he was a really, he trusted me with a lot. Right. So he's, he was able to, um, just set me loose on a couple of problems to start. And he never micromanages. So he just let me go for some that's a really difficult place to be, um, without having guidance in a difficult problem. But. For someone like me, that was absolutely fine and encouraging.

16:36

You know, and working with Andrew and I worked really closely with Bob Carpenter for a long time and that was really great because he has such a depth of knowledge and also humility that, I don't know, it's, it's fun working with Bob. Some of the other times that I've really gotten to grow in my career, we're sitting in on some amazing reading groups. So there are two that come to mind. At Columbia, Dave Bly runs a reading group for his group and got to sit in.

17:12

And those are phenomenal because they actually go deep into papers and really, really get at the content of the paper, what it's doing, what the research is. trying to infer what's going on, where the research is going next. But that really helped expand my horizon for things that I wasn't seeing while working in Andrew's group. So it was just, you know, much more machine learning oriented. And in a similar vein at Cambridge, I was able to sit in on Zubin Karamani's group.

17:48

Don't know why he let me, but he let me just sit in. I was group reading groups and He had a lot of good people there at the time. That was when Carl Rasmussen was there working on his book. Um, David Knowles, uh, I don't know who else, but just sitting there reading about these papers, reading these techniques, people presenting their own work inside the reading group. Um, yeah, my encouragement would be if you have a chance to go sit in on reading groups, go join them.

18:20

It's actually a good way, especially if it's not in your. area of focus. It's a good way to learn and make connections to literature that otherwise would be very hard to read on your own. Yeah, I mean, completely agree with that. And yeah, it feels like a dream team of mentors you've had. I'm really jealous. Like David Spiegelhalter, Andrew Gellman, Bob Carpenter, all those people. It's absolutely amazing. And I've had the chance of interviewing them on the podcast.

18:51

So I will definitely link to those episodes in the show notes. And yeah, completely agree. Today, I would definitely try and do it with Andrew, because I've talked with him quite a lot already. And yeah, it's really inspiring. And that's really awesome. And yeah, I completely agree that in general, that's something that I'm trying to do. And that's also where I started the podcast in a way. Surrounding yourself with smarter people than you is usually a good thing. good way to go.

19:36

And definitely me, I've had the chance also to have some really amazing mentors along my way. People like Ravin Kumar, Thomas Vicky, Osvaldo Martin, Colin Carroll, Austin Rushford. Well, Andrew Ganneman also with everything he's produced. And yeah, Adrian Zabolt also absolutely brilliant.

19:59

Luciano Paz. All these people basically in the times he... world who helped me when I was really starting and not even knowing about Git and taking a bit of their free time to review my PRs and help me along the way. That's just really incredible. So yeah, what I encourage people to do when they really start in that domain is much more than trying to find a... an internship that shines on us, trying to really find a community where you'll be surrounded by smart and generous people.

20:39

That's usually going to help you much more than a name on the CV. Absolutely. And so actually, I'd like to talk a bit about some of the Pharma-related models you've worked on. You've worked on so many topics. It's really hard to interview you. But a kind of model I'm really curious about, also because we work on that at labs from time to time, is farmer-related models. And in particular, can you explain how Bayesian methods are used in estimating the efficacy of oncology treatments?

21:20

And also, what are PKPD models? Yeah, let's start with PKPD models. So PKPD stands for pharmacometric pharmacodynamic models. And these models, the pharmacokinetics describe, so we take drug and it goes into the body. You can model that using, you know, you know how much drug goes in the body. And then at some point it has to exit the body through. absorption through something, right? So your liver can take it out. It'll go into your bloodstream, whatever. That's the kinetics part.

22:01

You know that the drug went in and it comes out. So you can measure the blood at different times. You can measure different parts of the body to get an estimate of how much is left. You can estimate how that works. The pharmacodynamic part is the more difficult part. So each person responds differently to the drug depending on what's inside the drug and how much concentration is in the body.

22:26

You and I could take the same dose of ibuprofen and we're going to ask each other how you feel and that number is, I don't know, is it on a scale of 1 to 10? You might be saying a 3, I might be saying a 4 just based on what we feel. There are other measurements there that... sometimes you can measure that's more directly tied to the mechanism, but most of the time it's a few hops away from the actual drug entering the bloodstream.

23:00

So the whole point of pharmacokinetic, pharmacodynamic modeling is just measuring drug goes in, drug goes out, what's the effect. trials and in design of how much dose to give people. So if you give someone double the dosage, are they actually gonna feel better? Is the level of drug gonna be too high such that there are side effects, so on and so forth. The way Bayesian methods play out here is that if we, you know, just really broad step.

23:46

If you take a step back, the last generation of models, assume that everyone came from, you were trying to estimate a population mean for all these things. So you're trying to take individuals and individual responses and try to get the mean parameters of a, usually a parameterized model of how the kinetics works and then the dynamics works.

24:15

it'd be better if we had hierarchical models that assumed that there was a, you know, a mean but each person's individual and that could describe the dynamics for each person a little better than it can for just using plugging in the overall. So to do that, you kind of ended up needing Bayesian models. But on top of that, the other reason why Bayesian models are really popular for this stuff right now is that...

24:47

The people that study these models have a lot of expertise in how the body works and how the drugs work. And so they've been wanting to incorporate more and more complexity into the models, which is very difficult to do inside the setting of certain packages that limit the flexibility. There's a lot of flexibility that you can put in, but there's always a limit. to that flexibility.

25:18

And that's where Stan and other tools like PyMC are coming into play now, not just for the Bayesian estimates, but really for the ability to create models that are more complex. And that are generative in particular? These are, because people are trying to really understand for these types of studies, they're trying to understand what happens. Like, what's the best dosage to give people? Should it be scaled based on the size of the human? What happens? You know, it's a lot of what happens.

25:57

Can you characterize what's going to happen if you give it to a larger population? You know, you've seen some variability inside the smaller trial. What happens next? Yeah, fascinating. And so it seems to me that it's kind of a really great use case for patient stats, right? Because, I mean, you really need a lot of domain knowledge here. You want that in the model. You probably also have good ideas of priors and so on.

26:39

But I'm wondering what are the main challenges when you work on that kind of model? The main challenges, I think, some of the challenges have to do with at least when I was working there. So mind you, I didn't work directly for a pharma company. We had a startup where we were building these models and selling to pharma. One of the issues is that there's a lot of historic... very good reasons for using older tools. They don't move as fast, right?

27:20

So you've got regulators, you've got people trying to be very careful and conservative. So trying out new methods on the same data, if it doesn't produce results that they're used to, it's a little harder to do there than it is, let's say in sports, right? In sports, no one's gonna die if I predict something wrong next year. If you use a model that's completely incompatible with the data in pharma and it gives you bad results, bad things do happen sometimes.

27:57

So anyway, things move a little slower. The other thing is that most people are not trained in understanding Bayesian stats yet. You know, I do think that there's a difference... in understanding Bayesian statistics from a theoretic, like on paper point of view, and actually being a pragmatic modeler of data. Um, and right now I think there's a turning point, right?

28:30

I think the world is catching up and the ability to model is spreading, uh, a lot wider and the, um, So anyway, I think that's part of that is happening in farm as well. Yeah, yeah, for sure. Yeah, these kind of models, I really find them fascinating because they are both quite intricate and complicated from a statistical standpoint. So you really learn a lot when you work on them. And at the same time, they are extremely useful and helpful.

29:12

And usually, they are about extremely fascinating projects that have a deep impact. on people, basically it's helping directly people who I find them absolutely fascinating. I mean, I can tell you that specifically, the place where I had difficulty working in PTA-PD models was that I didn't understand the biology enough. So there are these terms, these constants, these rate constants that describe elimination of the drug through the liver.

29:46

And because I don't don't know biology well enough, I don't know what's a reasonable range. And, you know, people that study the biology know this off the back, off the top of their head because they've studied the body, but they can't, you know, most aren't able to work with a system like STAND well enough to write the model down. And it's that mismatch that makes it really tough because then, you know, there's...

30:13

Some in some of the conversations we had in that world, it's, you know, why aren't you using a Jefferies prior? Why aren't you using a non-informative prior? But on the flip side, it's like, if that rate constant is 10 million, is that reasonable? No, it's not. It has to be between like zero and one.

30:29

So we should be, you know, like for me, it's if we put priors there, that limit that, that makes the modeling side of it a lot easier, but you know, as someone that didn't understand the biology well enough to make those claims, it made the modeling much, much more difficult and harder to explain as well. Yeah, yeah, yeah. Yeah, definitely. And the biology of those models is absolutely fascinating, but really, really intriguing.

31:04

And also, you've also worked on something that's called data fusion for US Navy applications. So that sounds very mysterious. How did Bayesian statistics contribute to these projects? And what were some of the challenges you faced? Unfortunately, I didn't know Bayesian stats at the time. This was when I first started working. But, you know, data fusion's actually... We should have used Bayesian stats. If I was working on a problem now, it should be done with Bayesian stats. The...

31:39

Just the problem in a nutshell, if you imagine you have an aircraft carrier, it can't move very fast, and what it has is about a dozen ships around it. All of them have radars. All of them point at the same thing. If you're sitting on the aircraft carrier trying to make decisions about what's coming at you, what to do next. If there's a single plane coming at you, that's one thing.

32:02

If all the 12 ships around you, you know, hit that same thing with the radar and it says that there are 12 things coming at you because things are slightly jittered, that's bad news, right? So, you know, if they're not identifying themselves. So the whole problem is, is there enough information there where you can... accurately depict what's happening based on multiple pieces of data. Hmm. Okay. Yeah, that sounds pretty fun. And indeed, yeah, lots of uncertainty.

32:35

So, and I'm guessing you don't have a lot of data. And also, it's the kind of experiments you cannot really remake and remake. So, your patient stats would be helpful here, I'm guessing. Yeah, it's, it's always the edge cases that are tough, right? It's, if the, if the, if the plane or the ship that's coming at you, says who they are, identifies themselves, and follows normal protocol. It's an easy problem, like you have the identifier, but it's when that stuff's latent, right?

33:08

People hide it intentionally. Then you have to worry about what's going on. The really cool thing there was a guy I worked for, Clay Stannick, had come up with a way to of each of the radar pictures and just stack them on top of each other. If you do that, if you see a high intensity, then it means that the pictures overlap. And if there's no high intensity, then it means the pictures don't overlap. And the nice thing is that that's rotation invariant.

33:47

So it really just helps with the alignment problem because everyone's looking at the same picture from different angles. Yeah, yeah, it's super interesting also. I love that. And you haven't had the opportunity to work again on that kind of models now that you're an Asian expert? No. Well, you've heard it, folks. If you have some model like that who are entertaining you, feel free to contact him or me, and I will contact him for you if you want. So actually.

34:26

I'm curious, you know, in general, because you've worked with so many people and in so many different fields. I wonder if you picked up some common misconceptions or challenges that people face when they try to apply vision stats to real world problems and how you think we can overcome them. Yeah, that's an interesting question. I think working with Dan, well, yeah, I think the common error is that we don't build our models complex enough.

35:14

They don't describe the phenomenon well enough to really explain the data. And I think that's where, that's the most common problem that we have. Yeah, the thing that I use the most, that I get the most mileage out of is actually putting on either a measurement model or just adding a little more complexity to model and it starts working way better. In pharmacometrics specifically, I remember we started asking, how do you collect the data? What sort of ways is the measurement wrong?

35:57

And we just modeled that piece and put it into the same parametric forms of the model and everything started fitting correctly. It's like, cool, I should do that more often. So yeah, I think if I was to think about that, that's sort of the thing. The other thing is, I guess people try to apply Bayesian stats, Bayesian models to everything, and it's not always applicable. I don't know if you're actually going to be able to fit a true LLM using MCMC. Like I think that'd be very, very difficult.

36:32

Um, so it's okay to not be Bayesian for that stuff. Yeah. So that's interesting. So nothing about priors or about model fitting or about model time sampling of the models. No, I mean, they're all related, right? The worst the model fits. So when a model doesn't actually match the data, at least running in Stan, it tends to. overinflate the amount of time it takes, the diagnostics look bad. A lot of things get fixed once you start putting in the right level of complexity to match the data.

37:22

But you know, that's yeah. I mean, is it MCMC is definitely slower than running optimization? That's true. Yeah. No, for sure. Yeah, I'm asking because as I'm teaching a lot, these are recurring themes. I mean, it really depends where people are coming from. But you have recurring themes where that can be kind of a difficulty for people. Something I've seen that's pretty common is understanding the different types of distributions.

38:00

So prior predictive samples and prior samples, how do they differ? Posterior samples, post-hereditary samples, what's the difference between all of that? That's definitely a topic of complexity that can trigger some difficulty for people. And I mean, I think that's quite normal. I remember personally, it took me a few months to really understand that stuff when I started learning Baystance.

38:32

And now with my educational content, decrease that time for people so that they maybe make the same mistakes as me, but they realize it's faster than I did. That's kind of the objective. Yeah, that's really good. So what other things do you see that people are struggling with? Or do you have, you know, what are some of the common themes right now? I mean, priors a lot.

39:06

priors is extremely common, especially if people come from the classic machine learning framework, where it's really hard for them to choose a prior. And actually something I've noticed is two ways of thinking about them that allows them to kind of be less anxious about choosing a prior. which is one, making them realize that having flat priors doesn't mean not having priors.

39:45

And so the fact that they were using flat priors before by default in a class equalization regression, for instance, that's a prior. That's already an assumption. So why would you be less comfortable making another assumption, especially if it's more warranted in that case?

40:03

So. Basically trying to see these idea of priors along a slider, you know, a gradient where you would have like the extreme left would be the completely flat priors, which lead to a completely overfit model that has a lot of variance in the predictions. And then at the other end of the slider, extreme right would be the completely biased model where your priors would basically be, you know, either a point or completely outside of the realm of the data and then you cannot update, basically.

40:37

But that would be a completely underfit model. So in a way, the priors are here to allow you to navigate that slider. And why would you always want to be to the extreme left of the slider, right? Because in the end, you're already making a choice. So why not thinking a bit more exhaustively and clearly about the choice, explicitly about the choices you're making. Yeah, that already usually helps them to make them feel less guilty about choosing prior. So that's interesting. Yeah, absolutely.

41:13

And so to go on that point a little bit, that's what I'm trying to say with the complexity of the model. It's like, if we just assume normal things a lot of times, but sometimes things aren't normal. There's more variance than normal. So. making something a t-distribution sometimes fixes it. Just understanding the prior predictive, the posterior, the posterior predictive draws also summarizing those, looking at the data really helps.

41:45

One thing that I think for anyone trying to do models in production, one thing to know is that models, the programs that you write, either in PyMC or Stan, the quality of the fit is not just the program itself, it's the program plus the data. If you swap out the data and it has different properties than the one that you trained it on before, it might actually have worse properties or better properties.

42:17

And we can see this with like non-centered parameterization and different variance components being estimated in weird ways. if you just blindly assume that you can go and take your model that fit on one data and just blindly productionize it. It doesn't quite work that way yet, unfortunately. Yeah, yeah, yeah. For sure.

42:40

And also, another prompt that I use to help them understand a bit more, basically, why we're using... generative models and why that means making assumptions and how to make them and being more comfortable making assumptions is, well, imagine that you had to bet on every decision that your model is making. Wouldn't you want to use all the information you have at your fingertips, especially with the internet now?

43:14

It's not that hard to find some information about the parameters of any model you're working on and find a pattern.

43:21

somewhat informed prior because you don't need, you know, there is no best prior so you don't need the perfect prior because it's a prior, you have the data so it's going to be updated anyways and if you have a lot of data it's going to be washed out so but you know if you had to bet on any decision you're making or that your model is making wouldn't you want you to use all the information you have available instead of just throwing your hands in the

43:50

air and being like, oh, no, I don't know anything, so I'm going to use flat priors everywhere. You really don't know anything? Have you searched on Google? It's not that far. So yeah, that usually also helps when you frame it in the context of basically decision-making with an incentive, which here would be money. betting for your life, then, well, it would make sense, right, to use any bit of information that you can put your hands on. So why won't you do it here?

44:30

Actually, I'm curious with your extensive experience in the modeling world, do you have any advice you would give to someone looking to start a career in computational Bayesian stats or data science in general? Yeah, my, my advice would probably to go try to go deeper in one subject or not one subject, go deeper in one dimension than you're comfortable going.

45:06

If you want to get into like actually building out tools, go deep, understand how PyMC works, understand how Stan works, try to actually submit pull requests and figure out how things are done. If you want to get into modeling, go start understanding what the data is. Go deep. Don't just stop at, you know, I have data in a database. Go ask how it's collected. Figure out what the chain actually is to get the data to where it is. Going deep in that way, I think, is going to get you pretty far.

45:40

It'll give you a better understanding of how certain things are. You never know when that knowledge actually comes into play and will help you. But a lot of the... Yeah, that would be my advice. Just go deeper than maybe your peers or maybe people ask you to. Yeah, that's a really good point.

46:09

Yeah, I love it and that's true that I was thinking, you know, in the people around me, usually, yeah, it's that kind of people who stick to it with that passion, who are in the place they want it to be at because, well, they also have that passion to start with. That's really important. I remember someone recently asked me like, should they focus on machine learning, Beijing stats, is Beijing stats going to go away, is AI taking over?

46:42

And my answer to that, I think was pretty much along the lines of go and learn any of them really well. If you don't learn any of them really well, then you'll just be following different things and be bouncing back and forth and you'll miss everything. But if you... end up like Bayesian stats has been around for a while and I don't think it's going to go away.

47:05

But if you bounce from Bayesian stats, try to go to ML, try to go to deep learning without actually really investing enough time into any of those, when it comes down to having a career in this stuff, you're going to find yourself like a little short of expertise to distinguish yourself from other people. So that, you know, that's... That's where this advice mentality is coming from. Especially just starting out.

47:36

I mean, there's so many things to look at right now that, you know, it's, it's hard to keep track of everything. Yeah, no, for sure. That's definitely a good point, too. And actually, in your opinion, currently, what are the main sticking points in the Bayesian workflow that you think we can improve? All of us in the community of probabilistic programming languages, core developers, Stan, IMC, and all those PPLs, what do you think are those sticking points?

48:14

would benefit from some love from all of us? Oh, that's a good question. You know, in terms of the workflow, I think just usability can get better. We can, we can do a lot more from that. Um, with that said, it's, it's hard. Like the tools that we're talking about are pretty niche. And so it's, it's not like there are, um, millions and millions of users of our techniques, so it's, you know, the, it's just hard to do that.

48:56

Um, but you know, the, the thing that I run into a lot are transformations of prom and I really wish that we end up with, um, reparameterizations of problems automatically such that it fits well with the method that you choose. Um, if we could do that, then life would be good, but, uh, you know, I think that's a hard problem to tackle. Yeah, I mean, for sure.

49:26

Because, and that's also something I've started to look into and hopefully in the coming weeks, I'll be able to look into it for our Prime C. Precisely, I was talking about that with Ricardo Viera, where we were thinking of, you know, having user wrapper classes on some, on some distributions, you know.

49:53

normal beta-gap with the classic reparameterization, where instead of letting the users, I mean, making the users have to reparameterize by hand themselves, you could just ask Climacy to do pm.normal non-centered, for instance, and do that for you. In other words, that'd be really cool. So of course, these are always... bigger PRs than you suspect when you start working on them. But that definitely would be a fun one. So, and then that'd be a fun project I'd like to work on in the coming weeks.

50:32

But we'll see how that goes with open source. That's always very dependent on how much work you have to do before to actually pay your rent and then see how much time you can afford to dedicate to open source, but hopefully I'll be able to make that happen and that'd be definitely super fun. And actually talking about the future developments, I'm always curious about Stan.

51:02

What do you folks have on your roadmap, especially some exciting developments that you've seen in the works for the future of Stan? So I actually haven't, I don't know what's coming up on the roadmap too much. Lately, I've been focused on working on my new job and so that's good. But a couple of the interesting things are Pathfinder just made it in. It's a new VI algorithm, which I believe addresses some of the difficulties with ADVI. So that should be interesting.

51:38

And finally tuples should land if it hasn't already landed inside the scan language. So that means that you can return from a function multiple returns, which should be better for efficiency in writing. things down in the language. Other than that, it's like, you know, there's always activity around new functionality in Stan and making things faster. And the, you know, interface, the work on the interface is where it makes it a lot easier to operate Stan is always good.

52:18

So there's command-stan-r command-stan-pi that really do a lot of the heavy lifting. Yeah. Yeah, super fun. For sure, I didn't know Pathfinder was there, but definitely super cool. Have you used it yourself? And is there any kind of model you'd recommend using it on? No, I haven't used it myself. But there is a model that I'm working on at Zellis that I do want to use it on. So we're doing, we call it. component skill projection models.

52:54

So you have observations of how players are doing for many measurements, and then you have that over years, and you can imagine that there are things that you don't observe about them that kind of, you know, there's a function that you apply to the underlying latent skill that then produces the output. And, you know, over time you're trying to estimate over time what that does. And so for something like that, I think using an approximate solution would probably be really good.

53:30

Yeah. Do you already have a tutorial page on this 10 website that we can refer people to for that episode's show notes? I'm not sure. I could send it to you, though. I believe there's a Pathfinder paper out in the archives. Bob Carpenter's on it. OK, yeah, for sure. Yeah, add that to the show notes, and I'll make sure to put that on the website when your episode goes out, because I'm sure people are going to be curious about that.

53:57

Yeah. And more generally, are there any emerging trends or developments in Bayesian stats that you find particularly exciting or promising for future applications? No, but I do feel like the adoption of Bayesian methods and modeling, there's still time for that to spread. especially in the world now where LLMs are the biggest rage and it's, you know, LLMs are being applied everywhere, but I still think that there's space for more places to use really smart, complex models with limited data.

54:49

So with the, with all these tools, I just think that, you know, more industries need to catch on and start using them. Yeah, I see. Already, I'm pretty impressed by what you folks do at Zillus. That sounds really funny and interesting. And actually, one of their most recent episodes I did, episode 91, with Max Gebel, was talking about European football analytics. And I'm really surprised. So I don't know if you folks at Zillus work already on the European market, but I'm really impressed.

55:27

I'm pretty impressed in how mature the US market is on that front of spots analytics. And on the contrary, how at least continental Europe is really, really far behind that curve. I am both impressed and appalled. I'm curious what you know about that. I don't think anyone's that far behind right now. So I know you had Jim Albert on the show too, and I heard both of those. Right. And the, the thing that I'm really excited about right now is making all the models more complex, right?

56:10

So I think that, you know, we probably have some of the more advanced models or at least up to industry standard in a lot of them and like more complex than others when I, you know, I just got here. got to the company and when I look at it, I think there's like another order of complexity that we can get to using the tools that already exist. And that's where I'm excited. It's the data is out there. It's been collected for, you know, five years, 10 years. Uh, there's new tracking data.

56:42

That's, you know, that that's happening. So there's more data coming out, more fidelity of data, but even using the data that we have, um, A lot of the models that people are fitting are at the summary level statistics. And that's great and all. We're making really good things that people can use using that level of information. But we can be more granular about that and write more complex models and have better understanding of the phenomenon, like how these metrics are being generated.

57:22

And I think that's, for me, that's what's exciting right now. Yeah. And that's what I've seen too, mainly in Europe, where now you have amazing tracking data. Really, really good. In football, I don't know that much because unfortunately I haven't had any insight peeking that I've had for rugby. And I mean, that tracking data is absolutely fantastic. It's just that people don't do models on them. They just do descriptive statistics. which is already good, but they could do so much from that.

58:01

But for now, I haven't been successful explaining to them what they would get with models. And something that I'm guessing is that there is probably not enough competitive pressure on this kind of usage of data. Because I mean, Unless they are very special, a sports team is never going to come to you as a data scientist and tell you, hey, we need models. Because they don't really know what the difference between a mean and a model actually is.

58:41

So usually these kinds of data analytics are sold by companies here in Europe. And well, from a company standpoint, they don't have a lot of competitive pressure. Why would you invest in writing models which are hard to develop and takes time and money? Whereas you can just, you know, sell raw data that then you do stat desk on. And that costs way less and still you're ahead of the competition with that. Kind of makes sense. So yeah, I don't know.

59:19

I'm curious what you've seen and I think the competitive pressure is way higher in the US, which also explains why you are. trying to squeeze even more information from your data with more complex models. Yeah. I think you've described sort of the path of a lot of data analytics going into a lot of industries, which is like, the first thing that lands is there exists data, let's go collect data.

59:47

Let's go summarize data, and then someone will take that and sell it to the people that collected the data. And that's cool. And I always think the next iteration of that is taking that data and doing something useful and deriving insight. The thing that baseball has done really well was linking, um, runs to outcomes that they cared about winning games. Right. It's like you increase your runs, you win games. You decrease your runs, you lose games. Right. It's pretty simple.

01:00:18

Um, so this is where it's, you know, even I'm having trouble right now too. It's, it's, um, for basketball, like you shoot slightly higher percentage, you're gonna score a little more, but does that actually increase your wins? Yeah. And that's really tough to do in the context of five on five. If you're talking about rugby, you got, is it nine on nine or is it 11? It's 15. 15, right? Classic European rugby is 15, yeah. Like the World Cup that's happening right now.

01:00:51

So if you got 15 players, like... What's the impact of replacing one player? And it starts getting a lot harder to measure. So I do think that there's, so even from where I'm sitting, it seems like there's a lot of hype around collecting data and just visualizing data and understanding what's there. And people hope that a cool result will come out by just looking at data, which I do hope that it will happen. But as soon as the lowest line fruit is picked, the next thing has to be models.

01:01:30

And yeah. Yeah, exactly. Completely agree with that. And I think it's for now, it's still a bit too early for Europe for now. It's going to come, but we can have already really good success by just doing stat desk, because a lot of people are just not doing it. And so recruiting and training just based on gut instinct. which is not useless but can definitely be improved.

01:01:57

You know, one of the other things about sport that's really difficult is that, when we talk about models, we assume everything is normally distributed. We assume that the central limit there and holds or the law of large numbers and all these things are average. When you talk about the highest level of sport, you're talking about the tail end of the tail end of the tail end. And that is not normal. And I'm seeing somebody to model. This is where, like I said, I'm really excited.

01:02:28

It's not everywhere, but a lot of times we do assume that's normal normality assumptions. And I don't think they're normal. And I think if we actually model that properly, we're going to actually see some better results. But it's early days for me. So. Yeah, it's actually a good point.

01:02:49

Yeah. I hadn't thought of that, but yeah, it definitely makes sense because then you get to scenarios which are really the extreme by definition, because even the people you have in your sample are extremely talented people already. So you cannot model that team the same way as you would model the football team from around the corner. Awesome, Daniel. Well, it's already been a long time, so I don't want to take too much of your time.

01:03:22

But before asking you the last two questions, I'm wondering if you have a personal anecdote or example to share of a challenging problem you encountered in your research or teaching related to Bayesian stats and how you were able to navigate through it. Oh, um... in teaching. I don't know. That one's a tough one. It's um... Yeah. I... It's a different one. Okay, here's one of the toughest ones was... Just kind of knowing when to give up.

01:04:09

So, going back to a workshop I taught maybe in like 2013, 2012, around Stan. I remember someone had walked in with a laptop that was like a 20-pound laptop. That was like 10 years old at that point and was I think running a 32-bit Windows. and asking for help on how to run Stan on this thing. I'm going to try to give up. Sometimes you just need better tools. It's a good point. Yeah, for sure. That's very true.

01:04:53

That's also something actually they want to... a message that I want to give to all the people using Pimc. Please install Pimc with Mamba and not Beep because Mamba is doing things really well, especially with the compiler, the C compiler, and that will just make your life way easier. So I know we repeat that all the time. It's in the readme. It's in the readme of the workshops we teach at Pimc Labs, and yet people still install So if you really have to install with peep, then do it.

01:05:30

Otherwise, just use MambaForge. It's amazing. You're not going to have any problems and it's going to make your life easier. There is a reason why all the Pimc card developers ask you that as a first question anytime you tell them, so I have a problem with my Pimc install. Did you use Mamba? So yeah, it was just a general public announcement that you made me think about that Daniel, thanks a lot.

01:05:58

Okay, before letting you go, I'm gonna ask you the last two questions I ask every guest at the end of the show. First one, if you had unlimited time and resources, which problem would you try to solve? My, I would try to solve the income disparity in the US and what that gets you. I'm thinking mostly health insurance. I think it's really bad here in the US. You just need resources to have health insurance and it should be basic. It's a basic necessity.

01:06:36

So working on some way to fix that would be awesome. unlimited time and energy. Yeah, I mean, definitely a great answer. First one, we get that, but totally agree, especially from a European perspective, it's always something that looks really weird to you when you're coming to the US. It's super complicated. Also, yeah. One of the things, like, working in pharma was like, realizing that a lot of the R&D budget is coming from you can call it overpayment from the American system.

01:07:15

And so if you still want new drugs that are better, it's got to come from somewhere, but not sure where. It's a tough problem. Yeah, yeah, yeah. I know for sure. And second question, if you could have dinner with a great scientific mind, dead, alive, or fictional, who would it be? That one, like I thought about this for a while. And you know, the normal cast of characters came up, Andrew, Delman, Bob Carpenter, Matt Hoffman. But the guy that I would actually sit down with is Sean Frayn.

01:07:51

You probably haven't heard of him. He's an American inventor. He has a company called Looking Glass Factory that does 3D holographic displays without the need of a headset. He happens to have been my college roommate and my big brother and my fraternity at New Delta at MIT. And I haven't caught up with him in a long time. So that's a guy I would go sit down with. That sounds like a very fun dinner. Well, thanks a lot, Daniel. This was really, really cool.

01:08:30

I'm happy because I had so many questions for you and so many different topics, but we managed to get that in. So yeah, thank you so much. As usual, I put resources in a link to your website in the show notes for those who want to dig deeper. Thanks again, Daniel, for taking the time to be on this show. You had to be easy change your predictions

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript