#239: Bayesian foundations

00:00

In this episode, we'll dive deep into one of the foundations of modern data science, Bayesian algorithms and Bayesian thinking. Join me along with guest Max Sklar as we look at the algorithmic side of data science. This is Talk Python To Me, episode 239, recorded November 10th, 2019. Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy.

00:39

Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via at Talk Python. This episode is brought to you by Linode and Tidelift. Please check out what they're offering during their segments. It really helps support the show. Max, welcome to Talk Python to Me. Thanks for having me, Michael. It's very great to be on. It's great to have you on as well. You've been on Python Bytes before, but never Talk Python To Me.

01:02

That was a lot of fun. I actually got someone reached out to me on Twitter the other day saying, hey, I saw you on Python Bytes. So that was really exciting. Right on, right on. That's super cool. I heard you on Python Bytes. I always say saw you when it's really heard you, but anyway. It's all good. So now they can say they saw you on Talk Python To Me as well. Now, we're going to talk about some of the foundational ideas behind data science,

01:26

machine learning. That's going to be a lot of fun. But before we get to them, let's set the stage and give people a sense of where you're coming from. How do you get into programming in Python? That is a really interesting question because I think I started in Python a very long time ago, like 10 years ago maybe. I was working on kind of a side project called stickymap.com. The website's still up. It barely works. But it was basically a, it was, it was like my senior project as an

01:53

undergrad. So I really, I started this in 2005. And what it was, was it was, you know, Google Maps had just come out with their API where you can like, you know, include a Google map on your site. And so I was like, okay, this is cool. What can I do with this? Let's add markers all over the map and it could be user generated. We would call them emojis now. And people could leave little messages and little locations and things like that. This was before there was Foursquare, which is where I

02:19

worked, which is location intelligence. This was just me messing around, trying to make something cool and being inspired by the whole host of like, you know, social media startups that were happening at the time. And I was using, what was I using at the time? I was using PHP and MySQL to put that together. I knew nothing about web development. So I went to the Barnes and Noble. I got that book, PHP, MySQL. I got it. But then sometime around like 2008, 2009, I realized, you know, a lot of people

02:48

were talking about Python at work. And I realized like, sometimes I need, this is kind of when I was winding down on the project, but I realized, you know, I had all this data and I realized I needed a way to like clean the data. I needed a way to like write good scripts that would clear up certain like if I have a flat file of like, here's the latitude and longitude, they're separated by tabs. And here's a, you know, here's some text that someone wrote that needs to be cleaned up,

03:16

et cetera, et cetera. Yeah, I can write some scripts in like Python or Java, believe it or not, which I knew at the time, but then, or sorry, a PHP or Python, which I knew at the time, but like, wait, wait, not job. Sorry. I was trying to do it in PHP and Java, which is really bad idea. Yeah. Especially PHP sounds tricky. Yeah. Yes, yes, yes. And then I was like, well, I'm just learning this Python. I need something. So let me try to do it with Python. And it worked

03:44

really well. And then I had, you know, to deal a lot more with CSVs and stuff like that tab separated files. And it really was just a way to like save time at work. And it was like a trick to say, hey, that thing that you're doing manually, I can do that in like 10 minutes. And it's not 10 minutes, maybe a couple hours and write a script. And it's going to take you like one week. Like I saw someone at work trying to change something manually. And so this is all a very long time ago. So I don't

04:12

remember exactly what it was, but it was kind of like a good trick to save time. And it had nothing to do with data science or machine learning at the time. It was more like writing scripts to clean up files. Well, that's perfect for Python, right? Like it's really one of the things that's super good at. It's so easy to read CSV files, JavaScript files, XML, whatever, right? It's just they're all like a handful of lines of code and you know, magic happens.

04:34

Yeah. The one thing that I was really impressed with was like, how easy at the time now, when I wanted to do more complicated Python packages in like 2012, 2013, I realized, oh, actually, some of these packages are complicated to install. But like, I was so impressed with how easy it was to just import the CSV package and just be like, okay, now we understand your CSV. If you have some

04:57

stuff in quotes, no problem. If you want to clean up the quotes, no problem. Like it was all just like, it just happened very fast. Yeah. You don't have to statically link to some library or add a reference to some other thing or none of that, right? It's all good. It's all right there. Yeah. I mean, that was, those were the days when like, I was still programming in C++ for work. So you could imagine what, how big of a jump that was. I mean, that seems so ancient. I used to have to

05:21

program in C++ for the Palm Pilot. That was my first job out of school, which is crazy. Oh, wow. That sounds interesting. Yeah. Yeah. Yeah. Coming from C++, I think people have two different reactions. One, like, wow, this is so easy. I can't believe I did this in so few lines. Or this is cheating. It's not real programming. It's not for me, you know? But I think people go, who even disagree, like, oh, this is not for me, eventually like find their way over. They're pulled in.

05:48

I never had a phase where it was like, oh, this is not for me. But I did have a phase where it was like, I don't see, this is just another language. And I don't see why it's better or worse than any other. I think that's the phase that you go through when you learn any new language where it's like, okay, I see all the features. I don't see what this brings me. It was only through doing those specific projects where it was like, aha, no one could have convinced me.

06:10

Yeah. Also, you know, if you come from another language, right, if you come from C++, you come from Java, whatever, you know how to solve problems super well in that language. And you're comfortable. And when you sit down to work, you say, file a new project and file, new files, start typing. And it's like, okay, well, what do I want to do? I want to call this website or talk to this database. I'm going to create this and I'll do this. And bam, like, you can just do it. You don't have to just

06:35

pound on every little step. Like, how do I run the code? How do I use another library? What libraries are there? Is there like, there's every, you know, it's just that transition is always tricky. And it takes a while before you, you get over that and you feel like, okay, I really actually do like it over here. I'm going to put the effort into learn it properly because I don't care how amazing it is. You're still going to feel incompetent at first.

07:03

The switching costs are so tough. And that's why they say, oh, if you're going to build a new product, it has to be like 10 X better than the one that exists or something like that. I don't know if that's, you know, literally true, but like it's true with languages too, because it's really hard to like pick up a new language and everyone's busy at work and busy doing all the tasks they need to do

07:21

every day. For me, frankly, it was helpful to take that time off in quotes, time off. When I was going to grad school, time off from working full-time as a software engineer to actually pick some of this stuff up. Absolutely. All right. So you had mentioned earlier that you do stuff at Foursquare and it sounds like your early programming experience with sticky maps is not that different than Foursquare,

07:44

honestly. Tell people about what you do. Maybe, I'm pretty sure everyone knows what Foursquare is, what you guys do, but tell them what you do there. People might not be aware of where Foursquare is today. You know, there is Foursquare is kind of known as that quirky check-in app, find good places

08:01

to go with your friends and eat app, you know, share where you are. And that's where we were in 2011, where, when I joined up to, you know, a few years ago, but ultimately, you know, the company kind of pivoted business models and sort of said, Hey, we have this really cool technology that we built for the consumer apps, which is called Pilgrim, which essentially takes the data from your phone and

08:25

translates that into stops. You know, you'd stopped at Starbucks in the morning, and then you stopped at this other place, and then you stopped at work, et cetera, et cetera. And then, you know, that goes into, that finds use cases like, you know, across the apposphere, I don't even know what to call it,

08:41

but many apps would like that technology. And so we have this panel and, you know, so for a few years, I was working on a product at Foursquare called Attribution, where companies, our clients would say, Hey, we want to know if our ads are working, our ads across the internet, not just on Foursquare. And we would say, well, we could tell you whether your ads are actually causing people to go into your

09:03

stores more than they otherwise would. And I worked on that for a few years, which is a really cool problem to solve, a really cool data science problem to solve, because it's a causality problem. It's not just, you know, you can't just say, well, the people who saw the ads visited 10% more, because maybe you targeted people who would have visited 10% more. Exactly. I'm targeting my demographic, so they better visit more. I got it wrong.

09:26

That industry is a struggle, because the people that you're selling to often don't have the backgrounds to understand the difference, and sometimes don't have the incentives to understand the difference. But we did the best we could. And so that led to kind of an acquisition that Foursquare did earlier this year of Placed, which was an attribution company owned by Snap, but they sold it to us through this big deal. You can read about it online. Giant tech company trade.

09:56

Yeah. And so I had left Foursquare in the interim, but then I recently went back to work with the founder, Dennis Crowley, and just kind of building new apps and trying to build cool apps based on location technology, which is really why I got into Foursquare, why I get into Sticky Map, and I'm just having so much fun. So that's, and we have some products coming along the way where it's not enterprise. It's not, you know, measuring ads. It's not ad retargeting. It's just

10:26

building cool stuff for people. And I don't know how long this will last, but I couldn't be happier. Sounds really fun. I'm sure Squarespace is, sorry, Squarespace. You're not the first fan. Squarespace is around here. Foursquare is in New York where you are. Now, I'm sure that that's a great place to be, and they're doing a lot of stuff. They used something like Scala. There's some functional programming language that primarily there, right? Is it Scala?

10:53

Yeah, it's primarily Scala. I've actually done a lot of data science and machine learning in Scala. And sometimes I'm kind of envious of Python because there's better tools in Python. And we do some of our, we do some of our initial testing on data sets in Python sometimes, but there is a lot of momentum to go with Scala because all of our backend jobs are written in Scala. And so we often have to translate it into Scala, which has good tools, but not as good as Python.

11:19

Yeah. Yeah. So I was going to ask, what's the Python story there? Do you guys get to do much Python there? Yeah. So I have done, if I can take you back in the, to the olden days of 2014, if that's, if that's allowed, because one of the things that I did at Foursquare that I'm pretty proud of is building a sentiment model, which is trying to take a Foursquare tip, which were like three sentences that people wrote in Foursquare on the Foursquare City Guide app. And that gets surfaced

11:50

later. It was sort of compared to the Yelp reviews, but except they're short and helpful and not as negative. What we want to do is we want to take those tips and try to come up with the rating of the venue because we have this one to 10 rating that every venue receives. And so using the likes and dislikes explicitly wasn't good enough because there were so many people who would just click like

12:12

very casually. And so we realized at some point, Hey, we have a labeled training set here. We can say, Hey, the person who explicitly liked a place and also left a text tip, that is a label of positive. And someone who explicitly disliked a place, that's a label of negative. And someone who left the middle option, which we called a meh or a mixed review, their tip is probably mixed. And so we have this tremendous data set on tips and that allowed us to build a model, a pretty good model. And it

12:41

wasn't very sophisticated. It was multi-logistic regression based on sparse data, which was like what phrases are included in the tip. Right. Trying to understand the sentiment of the actual words, right? Yeah. There was logistic regression available in Python at the time, which is great, but I wanted something a little custom, which is now available in Python. But back then it was kind of hard to find these packages and not just that there, even when there were packages, sometimes

13:08

it's difficult to say, okay, is this working? How do I test what's going on into the hood? It's not very, so I decided to build my own in Python, which was a multi-logistic regression means we're trying to find out three categories like positive review, negative review, or mixed review based on the label data. And we were going to have a sparse data set, which means it's not like there are 20 words that we're

13:34

looking for. No, there are like tens of thousands. I don't know the exact number, tens of thousands, hundreds of thousands of phrases that we're going to look for. And for most of the tips, most of the phrases are going to be zero. Didn't see it, didn't see it, didn't see it. But every once in a while, you're going to have a one, didn't see it. So that's when you have that matrix where most of them are zero, that's sparse. And then thirdly, we wanted to use elastic net, which meant that

13:56

most of the weights are going to be set to exactly zero. So when we store our model, most words, it's going to say, hey, these words aren't sentiment. So we're just going to, these don't really affect it. We want to have it exactly zero, except what a traditional logistic regression would do is it would say, okay, we are going to come up with the optimal, but everything will be close to zero. And so you have to kind of store it. You have to store the

14:21

like 0.0001. So that's a problem too. So I actually built that kind of open source and put that on my GitHub on base pi back in 2014. I don't think anyone uses it, but it was a lot of fun. I use Cython to make go really fast. It's kind of a problem at Foursquare because it's the only thing that runs in Python. And every once in a while, someone asks me like, what's this doing here? Exactly. How do I run this? I don't know. This doesn't fit to our world, right? Yeah.

14:45

Cool. All right. Well, Foursquare sounds really fun. Another thing that you do that I know you from, I don't know you through the Foursquare work that you're doing. I know you through your podcast, The Local Maximum, which is pretty cool. You had me on back on episode 73. So thanks for that. That was cool. That is our most downloaded episode right now. Really? Wow. Awesome. Yeah. That's super cool to hear.

15:07

Yeah. More relevant for today's conversation, though, would be episode 78, which is all about Bayesian thinking and Bayesian analysis and those types of things. So people can check that out for a more high level, less technical, more philosophical view, I think, on what we're going to talk about if they want to go deeper, right?

15:27

Absolutely. You could also ask me questions directly because I ramble a little bit in that, but I cover some pretty cool ideas, some pretty deep ideas there that I've been thinking about for many years. Yeah, for sure. So maybe tell people just really quickly what The Local Maximum is, just to give you a chance to tell them about it. Yeah. So I started this podcast about a year and a half ago in 2018.

15:48

And it started with, you know, I started basically interviewing my friends at Foursquare being like, hey, this person's working on something cool, that person's working on something cool, but they never get to tell their story. So why not let these engineers tell their story about what they're working on? And since then, I've kind of expanded it to cover, you know, current events and interesting

16:09

topics in math and machine learning that people can kind of apply to their everyday life. Some episodes get more technical, but I kind of want to bring it back to the more general audience that it's like, hey, my guests and I, we have this expertise. We don't just want to talk amongst ourselves. We want to actually engage with the current events, engage with the tech news and try to think, okay, how do we

16:29

apply these ideas? And so that's sort of the direction that I've been going in. And it's been a lot of fun. I've expanded beyond tech several times. I've had a few historians on, I've had a few journalists on. That's cool. I like the intersection of tech and those things as well. Yeah, it's pretty nice. This portion of Talk Python To Me is brought to you by Linode. Are you looking for hosting that's fast, simple, and incredibly affordable? Well, look past that bookstore and check out Linode at

16:57

talkpython.fm/Linode. That's L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server with a gig of RAM. They have 10 data centers across the globe. So no matter where you are or where your users are, there's a data center for you. Whether you want to run a Python web app, host a private Git server, or just a file server, you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24-7 friendly support, even on holidays, and a seven-day money-back

17:26

guarantee. Need a little help with your infrastructure? They even offer professional services to help you with architecture, migrations, and more. Do you want a dedicated server for free for the next four months? Just visit talkpython.fm/Linode. Let's talk about general data science before we get into the Bayesian stuff. So I think one of the misconceptions in general is that you have to be a mathematician or be very good at math to be a programmer. I think that's a false statement.

17:59

To be a programmer. Yes, yes. Software developer. Straight up, I built the checkout page on this e-commerce site, for example. I would agree. I think you need some abstract thinking. You can't escape letters and stuff and variables, but you don't need, well, in the case of data science to compare, like you don't need, you don't need algebra or you don't need maybe a little bit, but you don't really need calculus and

18:23

you don't need geometry, linear algebra and geometry. Yeah. Sometimes it's a UI engineer. You might need a little geometry. I mean, there's certain parts that you need that kind of stuff. Like video game development, for example, everything is about multiplying something by a matrix, right? You put all your stuff on the screen, even arrange it and rotate it by multiplying by matrices. There's some stuff happening there you

18:44

got to know about, but generally speaking, you don't. However, I feel like in data science, you do get a little bit closer to statistics and you do need to maybe understand some of these algorithms. And I think that's where we can focus our conversation for this show is like, what do we need to know in general? And then the idea of Bayesian Bay's theorem and things like that.

19:06

What do we need to know if I wanted to go into say data science? Because like I said, I don't really think you know that need to know that to do like, you know, connecting to a database and like saving a user. And you absolutely need logical thinking, but not like stats, but for data science, what do you think you need to know? Well, for data science, it really depends on what you're doing and how far down the rabbit hole you

19:29

really want to go. You don't necessarily need all of the philosophical background that I talk about. I just love thinking about it. And it sort of helps me focus my thoughts when I do work on it to kind of go back and think about the first principles. So I get a lot of value out of that, but maybe not everyone does. There is sort of a surface level data science that or machine learning

19:53

that you can get away with. If you want to do simple things, which is like, hey, I want to understand the idea that I have a training set, you know what a training set is, and this is what I want to predict. And here is roughly my mathematical function of how I know whether I'm predicting it well or not, but it could be something simple like the square distance, but already you're introducing some math there. And basically, I'm going to take a look at some libraries and I'm going to

20:23

see if something works out of the box and gives me what I need. And if you do it that way, you need a little bit of understanding, but you don't need everything that like I would say kind of a true data science or machine learning engineer needs. But if you want to go deeper and kind of make it your profession, I would say you need kind of a background in calculus and linear algebra. And again, like, look, if I went back to grad school and I like if I went to a linear algebra

20:52

final and I took it right now, would I be able to get every question right? Probably not. But I know the basics and I have a great understanding of how it works. And if I look at the equations, I can kind of break it down, you know, maybe with a little help from Google and all that. I think there's a danger of using these libraries to make predictions and other stuff when you're like, well, the data goes in here to this function and then I call it and then out comes the answer.

21:18

Maybe there's some conditionality versus independence requirement that you didn't understand and it's not met or, you know, whatever, right? That's why I said it's really surface level and you can get away with it sometimes, but

21:30

only for so long. And I think understanding where these things go wrong outside the, you know, when you take these black box functions requires both kind of a theoretical understanding of how they work and then also just like experience of seeing things going wrong in the past. Yeah. That experience sounds hard to get, but it seems like I'm an experience, right? You just, you got to get out there and do it.

21:52

Right. Well, here's a good example. One time I was trying to predict how likely someone is to visit a store. This was part of working on Foursquare's attribution product, right? And someone was using random forest algorithm, or maybe it was just a simple decision tree. I'm not sure, but basically it creates a tree structure and puts people into buckets and determines whether or not, you know, and for each bucket, it says, okay, in this bucket, everyone visited and in this bucket,

22:21

everyone didn't, or maybe this bucket is 90, 10 and this bucket is 10, 90. And so I can give good predictions on the probability someone will visit based on where they fall on the leaves of the tree. And we were using it and something just wasn't making sense to me. Somehow the numbers were just, something was wrong. And then I said, okay, let's make, let's make more leaves. And then I made more

22:44

leaves. Like I made, I made the tree deeper, right? And then they're like, see, when you make the tree deeper, it gets better. That makes sense because it's, it's more fine graining. I'm like, yeah, but something doesn't make sense. It shouldn't be getting this good. And then as I realized what was happening, what was it, what was happening was some of the leaves had nobody visited in this leaf. That makes a lot of sense because most days you don't visit any particular chain.

23:08

And when it went to zero and then it saw someone visited, well, the log likelihood loss, it basically predicted 0% of an event that didn't happen. And so log, when you do log likelihood loss or negative log likelihood loss, the score is like the negative log of that. So essentially you should be penalized infinitely for that because there was no smoothing. But the language we were using, which I think was spark or something like that. And it was probably some library and spark. I probably

23:40

shouldn't throw a spark under the bus. It was probably some library or something was changing that infinity to a zero. So the thing that was infinitely bad, it was saying was infinitely good. And so the worst thing. And that took, oh God, that took us so long to figure out. Like it's embarrassing how long that one took to figure out, but that's, that's a good example of when experience will get you in something. I don't think I've ever talked about this one publicly.

24:05

Yeah. Well, you just got to know that, you know, that's not what we're expecting, right? Yeah. But you know, theoretically, Hey, if I more fine grained my tree, if I, you know, make my groups smaller, maybe it works better. But I was like something, I was like, something's not right. It's working a little too good. There was nothing specifically that got me, but it was just like, there's probably a lot of stuff out there. That's actually people are taking actions on and

24:30

spending money on, but it's, it's like that, right? Yeah. Yeah. So let's see. So we talked about some of the math stuff. If you really want to understand the algorithms, you know, statistics, calculus, linear algebra, you obviously need calculus to understand like real statistics, right? Continuous statistics and stuff. What else though? Like, do you need to know machine learning? What kind of algorithms do you need to know? Like what, what in the computer science-y side of things

24:55

do you think you got to know? Bread and butter of the data scientists that I work with is machine learning algorithms. So I think that is very helpful to know. And I think that, you know, some of the basic algorithms in machine learning are good to know, which is like the K nearest neighbor, K means, logistic regression, decision trees, and then some kind of random forest algorithm, whether it's just random forest, which is a mixture of trees or gradient boosted trees we've had a lot

25:22

of luck with. And then a lot of this deep learning stuff is, well, neural networks is one of them. Maybe you don't need to be an expert in neural networks, but it's certainly one to be aware of. And based on these neural networks, deep learning is becoming very popular. And I've been hearing and kind of looking into reading about deep learning for many years, but I have to say, I haven't actually

25:45

implemented one of these algorithms myself. But I just interviewed a guy on my show, Mark Ryan, and he came out with a book called machine learning for structured data, which means, hey, you don't just, this doesn't just work for like images or audio recognition, you could actually use it for regular marketing data, like use everything else for. So I was like, all right, that's interesting. Maybe

26:06

I'll work on that now. But I don't think at this point, you need to know machine learning to be a good or deep learning to be a good data scientist or machine learning engineer. I think the basics are really good to know, because in many problems, you know, the basics will get you very far. And there's a lot less that can go wrong. Yeah, a lot of those algorithms you talked about as well, like K-Nearest Neighbor and so on.

26:26

There are several books that seem to cover all of those. I can't think of any off the top of my head, but I feel like I've looked through a couple and they all seem to have like, here are the main algorithms you need to know to kind of learn data science. So not too hard to pick them up. Slash names Bishop, the book that I read for grad school, but that's already 10 years old, certainly had all that stuff. That was very deep on math. I can send you a link if I want.

26:46

Sure. I think kind of any intro book to machine learning will have all of that stuff. And basically, it's not in order of like hard to easy. It's just sort of, hey, these are things that have helped in the past and that statisticians and machine learning engineers have relied on in the past to get started and it's worked for them. So maybe it'll work for you. Cool. Well, a lot of machine learning and data science is about making predictions. We have some

27:12

data. What does that tell us about the future, right? Right. That's where the Bayesian inference comes from in that world, right? Yeah. It's trying to form beliefs, which could be a belief about something that already happened that you don't know about, but you'll find out in the future or be affected by in the future, or it could be a belief about something that will happen in the future. So something that either will happen in

27:35

the future or you'll learn about in the future. But Bayesian inference is more about, you know, forming beliefs and I kind of call it like it's a quantification of the scientific method. So in the basic form, the Bayes rule is very easy. You start with your current beliefs and you codify that in a

27:53

special mathematical way. And then you say, okay, here's some new data I received on this topic. And then it gives you a framework to update your beliefs within the same framework that you've began with. Right. And so like an example that you gave would be say a fire alarm, right? We know from like life experience that most fire alarms are false alarms. You know, one example is what is your prior belief that there is a fire right now without seeing the alarm? The alarm is the data.

28:23

The prior is what's the probability that, you know, my building is on fire and I need to get the F out right now. You know, it's very low actually. Yeah. I mean, yeah, for most of us, it hasn't really happened in our life. Maybe we've seen one or two fires, but they weren't that big of a deal. I'm sure there are some people in the audience who have seen bad fires and for them, maybe their prior is a little higher. I once in my entire life have had to escape a fire. Yeah. Only once, right?

28:51

Were you in like real danger or? Oh yeah, probably. It was a car and the car actually caught on fire. Oh yeah. That sounds pretty bad. It had been worked on by some mechanics and they put it back together wrong. It like shot oil over something and it caught fire. And so we're like, Oh, the car's on fire. We should get out of it. Yeah. But yeah, sitting in your building at work, your prior is going to be much lower than in a car that

29:11

you just worked on. So when the alarm goes off, okay, that's your data. The data is that we received an alarm today. And so then you have to think about, okay, I still have two hypotheses, right? Hypothesis one is that there is a fire and I have to escape. And hypothesis two is that there is no fire. And so once you hear the alarm, you still have those two hypotheses. One is that the alarm is going off and there's a real fire. And two is that there is no fire, but this is a false alarm.

29:42

And so what ends up happening is that because there's a significant probability of a false alarm. So at the beginning, there is a very low probability of a fire. After you hear the alarm, there's still a pretty low probability of a fire, but the probability of a false alarm still overwhelms that. Now I'm not saying that you should ignore fire alarms all the time, but because in that case,

30:03

that's a, that's a case where the action that you take is important regardless of the belief. So, you know, Hey, there is a very low cost to checking into it, at least checking into it or leaving the building in, if you have a fire alarm, but there's a very high consequence of failure. So high. Exactly. Exactly. But in terms of just forming beliefs, which is a good reason not to panic, you shouldn't put a lot of probability on the idea that there's definitely a fire.

30:31

Okay. Yeah. So that's basically Bayesian inference, right? I know how likely a fire is. I have all of a sudden, I have this piece of data that now there is a fire. I have a set, a space of hypotheses that could apply, try to figure out which hypothesis, start testing and figure out which one is the right

30:50

one. Maybe. Yeah. So you take your prior. So let's say there's like a, I don't know, one in 10, a hundred thousand chance that there's a fire in the building today and a 99,999 chance there isn't. Then you take that, that's your prior. Then you multiply it by your likelihood, which is okay. What is the likelihood of seeing the data given that the hypothesis is true? So what's the likelihood that the alarm would go off if there is a fire? Maybe that's pretty high. Maybe that's close to one

31:21

or a little bit lower than one. And then on the second hypothesis that there's no fire, what's the likelihood of a false alarm today, which could actually be pretty high. Could be like one in a thousand or even one in a hundred in some buildings. And then you multiply those together and then you get an unnormalized posterior and that is your answer. So it's really just multiplication. Yeah. It's like simple fractions once you have all the pieces, right? So it's a pretty simple

31:45

algorithm. It's very hard to describe through audio, but it's much better visually if you want to check it out. I've been struggling to describe it through audio for, you know, for the last year and a half, but I do the best I can. This is like describing code. You can only take it so precisely. Yeah. This portion of Talk Python To Me is brought to you by Tidelift. Tidelift is the first managed open source subscription, giving you commercial support and maintenance for the open source

32:12

dependencies you use to build your applications. And with Tidelift, you not only get more dependable software, but you pay the maintainers of the exact packages you're using, which means your software will keep getting better. The Tidelift subscription covers millions of open source projects across Python, JavaScript, Java, PHP, Ruby, .NET, and more. And the subscription includes security updates, licensing, verification, and indemnification, maintenance and code improvements, package selection,

32:38

and version guidance, roadmap input, and tooling and cloud integration. The bottom line is you get the capabilities you'd expect and require from commercial software. But now for all the key open source software you depend upon. Just visit talkpython.fm/Tidelift to get started today. This comes from a reverend, Reverend Bays, who came up with this idea in the 1700s, but for a long time,

33:07

it wasn't really respected, right? And then it actually found some pretty powerful, it solved some pretty powerful problems that matters a lot to people recently. Yeah. I mean, I can't go through the whole, do the whole history justice in just a few minutes, but I'll try to give my highlights, which was this reverend who was sort of, he was a, you know,

33:28

he was into theology and he was also into mathematics. So he was probably like pondering big questions and he wrote down notes and he was trying to figure out the validity of various arguments. His notes were found after he died, so he'd never published that. And so this was taken by

33:46

Pierre Laplace, who was a more well-known mathematician and kind of formalized. But when the basis of statistical thinking was built in the late 20th, early 19th century, or late 19th, early 20th century, it really went in a more frequentist direction where it's like, no, a probability is actually a fraction of a repeatable experiment that kind of like over time, what fraction does it, does it end up

34:15

as? And so they consider probability as sort of a, an objective property of the system. So for example, a dice flip, well, each side is one sixth. That's like kind of an objective property of the, of the die. Whereas no Bayesian statistics is called sort of based on belief. And because belief kind of seemed unscientific and the frequentists had very good methods for coming up with, with answers and

34:40

more, more objective ways of doing it, they sort of had the upper hand. But as kind of the focus got into more complex issues and we had the rise of computers and that sort of thing, and the rise of more data and that sort of thing, Bayesian inference started taking a bigger and bigger role until now, I think most

35:02

machine learning engineers and most data science scientists think as a Bayesian. And so it's like some examples in history, most people are probably aware of Alan Turing at Bletchley Park, along with many other people, you know, building these machines that broke the German codes during World War II. It's all movie about it. Right. That's trying to break the Enigma machine and the Enigma code. And that, those were some important problems to solve, but also highly challenging.

35:31

Yeah. And so they incorporated a form of Bayes rule into this. Well, what are my relative beliefs as to the setting of the machine? Because, you know, the machine could have had quadrillions of settings and they're trying to distinguish between which one is likely to have and which one's not likely to have. But after the war, that stuff was classified. So nobody could say, oh yeah, Bayesian inference was

35:55

used in that problem. And one interesting application that I found, even as it wasn't accepted by academia for many years, was life insurance. Because they're kind of on the hook for determining if the actuaries get the answer wrong as to how likely people are to live and die, then they're on the hook for lots and lots of money or like the continuation of their company if they get it wrong. And so- Right. Right. Or how likely is it to flood here?

36:20

How likely is it for there to be a hurricane that wipes this part of the world off the map? Right. And a lot of these were one-off problems. You know, one problem is, you know, what's the likelihood of two commercial planes flying into each other? It hadn't happened, but they wanted to estimate the probability of that. And you can't do repeated experiments on that. So they really had to

36:38

use a priors, which was sort of like expert data. And then, you know, more recently, as we had the rise of kind of machine learning algorithms and big data, you know, Bayesian methods have become more and more relevant. But also a big problem was, you know, the problems that we just mentioned, which are, you know, fire alarms and figuring out whether or not you have a disease and things like that. That's the

37:02

two hypothesis problem. But a lot of times you have an infinite space, you have an infinite hypothesis problem that you're trying to determine between an infinite set of possible hypotheses. And that becomes very difficult to do, becomes extremely difficult without a computer, even with a computer becomes difficult to do. And so, you know, there's been a lot of research into how do you search that space

37:24

of hypotheses to find the ones that are most likely. And so if you've heard the term Markov chain Monte Carlo, that is the most common algorithm used. And for that purpose, there is even current research into that, to making that faster and finding the hypothesis you want more quickly. Andrew Gellman at

37:41

Columbia has some, a lot of stuff out about this. And he has like a new thing that's called like the nuts, which is like the no U-turn sampler, which is based off a very complicated version of MCMC. And so that's what's used in a framework that Python has called PyMC3 to come up with your most likely hypothesis very, very quickly. So let's take this over to the Python world. Yeah. Like, yeah, there's a lot of stuff that works

38:10

with it. And obviously, like you said, the machine learning deep down uses some of these techniques, but this PyMC3 library is pretty interesting. Let's talk about it. So its subtitle is probabilistic programming in Python. If I could start with some alternatives, which I've used because I haven't, I've been diving into

38:30

reading about PyMC3, but I haven't used it personally. So even when I was doing things in 2014, just on my own, basically without libraries, I was able to use Python very, very easily to kind of put in these equations for Bayesian inference on whether it's multi-logistic regression, or another one I did was Dirichlet prior calculator, which if I can kind of describe that, it's sort of thinking, well, how, what should I believe about a place before I've seen any reviews? Should I

39:00

believe it's good? Should I believe it's bad? You know, if I have very few reviews, what should I believe about it? Which was an important question to ask for something like four square city guide in many cases, because we didn't have a lot of data. And so that was a good application of Bayesian inference. And I was able to just use the equations straight up and kind of from first principles,

39:21

apply algorithms directly in Python. And it actually was not that hard to do because when searching the space, there was a single global maximum, didn't have to worry about the local maximum in these equations. So it was just a hill climbing. Hey, I'm going to start with this hypothesis in this n dimensional space, and I'm going to find the gradient, I'm going to go a little higher, a little higher, a little higher gradient ascent is what I described, although it's usually called

39:47

gradient descent. So that's sort of an easy one to understand. Then if you want to do MCMC directly, because you have some space that you want to search, and you have the equations of the probability on each of the points in that space, I used pi MC, which is spelled E M C E E, which is a simple program that only does MCMC. And so I had a lot of success with that when I wanted to do some one off sampling of, you know, non standard probability distributions. So those are ones that I've actually

40:24

used and had success with in the past. But pi MC three seems to be like the full, you know, we do everything sort of a thing. And basically, what you do is you program probabilistically. So you say, hey, I imagine that this is how the data is generated. So I'm just going to basically put that in code. And then I'm going to let you, the algorithm work backwards and tell me what the

40:50

parameters originally were. So if I could do a specific here, let's say I'm doing logistic regression, which is like, every item has a score, or, you know, in the case that I was working on, every word has a score, the scores git added up, that's then a real number, then it's transformed using a sigmoid into a number between zero and one. And that's the probability that's a positive review.

41:13

And so basically, you'll just say, hey, I have this vector that describes the words this has, then I'm going to add these parameters, which I'm not going to tell you what they are. And then I'm going to get this result. And then I'm going to give you the final data set at the end. And it kind of works backwards and tells you, okay, this is what I think the parameters were.

41:33

And what's really interesting about something like pi MC3, which I would like to use in the future is when you do a linear regression or logistic regression, in kind of standard practice, you get one model at the end, right? This is the model that we think is best. And this is the model that has the highest probability. And this is the model that we're going to use. Great. You know, that that works for a lot of cases. But what pi MC3 does is that instead of picking a model at the end,

42:02

it says, well, we still don't know exactly which model produced this data. But because we have the data set, we have a better idea of which models are now more likely and less likely. So we now have a probability distribution over models. And we're going to let you pull from that. So it kind of gives you a better sense of what the uncertainty is over the model. So for example, if you have a word in your data set, let's say the word's delicious, and it's a pod,

42:28

we know it's a positive word. But let's say for some reason, there's not a lot of data on it, then it can say, well, I don't really know what the weight of delicious should be. It's being used at rock concerts. We don't know why. What does it mean? Yeah, yeah, yeah. And so we're going to give you a few possible models. And, you know, and you can

42:45

keep sampling from that. And you'll see that the deviation, the discrepancy, the variance of that model is going to be very high of that weight is going to be very high, because we just don't have a lot of data on it. And that's something that standard regressions just don't do. That's pretty cool. And the way you work with it is, you basically code out the model and like a really nice Python language API. You kind of say, well, this, I think it's a linear model,

43:13

I think it's this type of thing. And then like you said, it'll go back and solve it for you. That's pretty awesome. I think it's nice. Right. A good thing to think about it is in terms of just a standard linear regression, like, what's the easiest example I can think of? Try to find someone's weight from their height, for example. And so you think there might be an optimal coefficient on there given the data.

43:36

But if you use PyMC3, it will say, no, we don't know exactly what the coefficient is given your data. You don't have a lot of data, but we're going to give you several possibilities. We're going to give you a probability distribution over it. And as I say, in the local maximum, you shouldn't make everything

43:50

probabilistic because there is a cost in that. But oftentimes you can, by considering something to be, rather than considering one single truth by considering multiple truths probabilistically, you can unlock a lot of value. In this case, you can kind of determine your variance a little better. Yeah, that's super cool. I hadn't really thought about it. And like I said, the API is super clean for doing this. So it's great. Yeah.

44:12

Where does this Bayesian inference, like, where do you see this solving problems today? Where do you see like stuff going? What's the world look like now? I've been using it to solve problems basically as soon as I started working as a machine learning engineer at Foursquare, basically using Bayes' rule as kind of my first principles whenever I approach a problem. And it's never driven me in the wrong direction. So I think it's one of those timeless things that

44:38

you can always use. For me, especially after working with our attribution product a lot, I think that the future is trying to figure out causality a lot better. And I think that's where some of these more sophisticated ideas come in. Because it's one thing to say, this variable is correlated with that and I can have a model. But it's like, well, what's the probability that this

45:00

variable, changing this variable actually causes this other variable to change? In the case of ads, where you could see where it's going to unlock a lot of value for companies where, you know, there might be a lot of investment in this, is what is the probability that this ad affects someone's likelihood to visit my place or to buy something from me more generally? Or what is my probability distribution over that? And so can I estimate that? And I think that that whole industry

45:31

of online ads is, it's very frustrating for an engineer because it's so inefficient. And there's so many people in there that don't know what they're doing. And it could be very frustrating at times. But I think that means also that there's a lot of opportunity to like unlock value if you have a lot of patience. Sure. Well, so much of it is just they looked for this keyword, so they must be interested, right? It doesn't take very much into account. Yeah, but the question is, okay, maybe they

45:56

look for that keyword and now they're going to buy it no matter what I do. So don't send them the ad, send the ad to someone who didn't search the keyword. Or maybe they need that extra push and that extra push is very valuable. It's hard to know unless you measure it. And you measure it, you don't get a whole lot of data. So you really, it really has to be a Bayesian model. Whoever uses these Bayesian

46:17

models is going to get way ahead. But right now it goes through several layers. I kept saying when we were working on this problem and people weren't getting what we were doing, I was like, I wish the people who are writing the check for these ads could get in touch with us because I know they care. But, you know, oftentimes you're working through sales and someone on the other side. It was just too many layers between, right? Yeah. Yeah, for sure.

46:42

Earlier, you spoke about having your code go fast and you talked about Cython. Oh yeah. What's your experience with Cython? I used that for the multi-logistic regression. And all I can say is it took a little getting used to, but, you know, I got an order of magnitude speed up, which we needed to launch that thing in our one-off Python job at Foursquare. So it took only a few hours versus all day. So it was kind

47:12

of a helpful tool to get that thing launched. And I haven't used it too much since, but I kind of keep that in the back of my mind as a part of my toolkit. Yeah. It's great to have in the toolkit. I feel like it doesn't get that much love, but I know people talk about Python speed and, oh, it's fast here. It's slow there. Yeah. First people just think it's slow because it's not compiled, but then you're like, oh,

47:34

but wait about the C extensions. You go, actually, yeah, that's actually faster than Java or something like that. So interesting. Yeah. I've also had a big speed up just by taking, you know, a dictionary or matrix I was using and then using NumPy instead of the, or NumPy, I don't know how you pronounce it, but instead of using- I go with NumPy, but yeah. Okay. NumPy instead of the standard, like, you know, Python tools, you could also get a big speed up there.

48:01

Yeah, for sure. And that's pushing it down into the C layer, right? Yeah. But a lot of times you have your algorithm and Python, and one option is to go write that C layer because you're like, well, we kind of need it. So here we go down the rabbit hole of writing C code instead of Python. But Cython is sweet, right? Especially the latest one, you can just put the regular type annotations, the Python three type annotations. Oh, yeah. On the types. And then, you know, magic happens.

48:24

I definitely, I just started with Python and it was like, you know, we're in this, these three functions 90% of the time, just fix that. It's usually the slow part is like really focused. Most of your code, it doesn't even matter what happens to it, right? It's just, there's like that little bit where you loop around a lot and that matters. Yeah. Yeah. It's funny how we over optimize and you can't escape it. Like even when I'm creating,

48:46

you know, I see like a bunch of doubles. I'm like, oh, but these are only one and zero. Can we like change them to Boolean? But like in the end, it doesn't care. It doesn't matter. For most of the code, it really has no effect. For sure. Except in that one targeted place. Yeah. So the trick is to use the tools to find it, right? Yeah.

49:02

Like C profiler or something like that. The other major thing, you know, one thing you can do to speed up stuff like this, these algorithms is just to say, well, I wrote it. I wrote it in Python or I use this data structure and maybe if I rewrote it differently or I wrote it in C or I applied Cython, it'll go faster. But it could be that you're speeding up the execution of a bad algorithm. And if you had a better algorithm, it might go a hundred times faster

49:28

or something, right? Like, so how do you think about that with your problems? That's what I did for the, back in 2014 with the Dirichlet prior calculator. And that was an interesting problem to solve because to recap on that, it's one of the use cases we had. Okay. What's my prior on a venue before I've gotten any reviews? What's my prior on a restaurant before I've gotten any reviews? And I'm using the experience of the data on all the other restaurants

49:53

I've seen. So we know what the variance is. And let me try to come up with an equation that can calculate that value from the data. And it turned out there were some algorithms available, but as I dug into the math, I noticed that there was like a math trick that I could make use of. In other words, it was something like certain logs were being taken of the same number, were being taken over and over again. And it's like, okay, just store how many times we took the

50:23

log. And then when I dug into the math, they kind of combined into one term and multiply that together. So essentially I used a bunch of factoring and refactoring, whether you think of it as factoring code or factoring math to get kind of an exponential speed up in that algorithm. And so that's why I published a paper on it. I was very proud of that. It was a, it was very satisfying thing to do. It might not have mattered in terms of our product, but I think a lot of people used it though,

50:49

to be like, I want rather than just taking an average of what I've seen in the past. No, I want to do something that is based on good principles. And so I want to use the Dirichlet prior calculator. And so some people have used that. It's my Python code online. And the algorithm has proven very fast and like almost instantaneous. Basically, as soon as you load all the data in,

51:13

it gives you the answer, which I like. Now, my next step to that is to use PyMC3, rather than giving you an answer, it should give you a probability distribution over answers. Yeah, that's right. I haven't done that yet. Didn't know about that. Yeah. Didn't know about that at the time. I think my speed up would still apply.

51:28

Yeah, that's cool. Well, that definitely takes it up a notch. What about learning more about Bayesian analysis and inference and like, where should people go for more resources? Oh, okay. Well, a kind of a history book that I read that I really like on Bayesian inference is one called The Theory That Should Not Die by Sharon McGrane, a few years old, but it's really good if you're interested in the history on that. I have a book about PyMC3, kind of a tech book that does go

51:56

into the basics of Bayesian inference that has a really good title. It's called Bayesian analysis with Python. Oh, yeah. Yeah, yeah. So that's a good one to look at. And then I have a bunch of episodes on my show that are related to Bayesian analysis. So episode zero and one on my show were basically just starting out trying to describe Bayes' role to everyone. I sort of attempted to do the description in episode

52:24

zero. And then in episode one, I applied it to the news story that was happening that day, which was kind of the fire alarm at the bigger scale, which was everyone in Hawaii getting this message that there's an ICBM missile coming their way because of a mistake someone made. And then- Yeah, because of some terrible UI decision on like the tooling. Yeah, is that what it was? Yeah, yeah.

52:47

Yeah. There was some analysis about what had happened and not probabilistically, but there was some, there's some really old crummy UI and they have to press some button to like acknowledge a test. Or treat it as real and somehow they look like almost identical or there's some weird thing about the UI that had like tricked the operator into saying, oh, it's real. Yeah, yeah. And then another couple episodes I want to highlight is episode 21 and 22,

53:13

which is sort of kind of 21 is the philosophy of probability. In 22, we talk about the problem of p-hacking, which is when people try their experiments over and over and until they get something that works with p-values, which is a frequentist idea, which works if you're using it properly. But the problem is most people don't. And then we did an episode, I think it

53:33

was 65 on probability, how to estimate the probability of something that's never happened. And then 78, the one that you mentioned, which was on the history of Bayes and a little more philosophy. So I've talked about that a lot. You could probably go to localmaxradio.com or localmaxradio.com slash archive and find the ones that you want.

53:52

That's really cool. So yeah, I guess we'll leave it there for now. That's quite interesting. And yeah, it gives us a look into some of the algorithms and math we got to know for our data science. Now, before you get out of here, though, I got the two questions I always ask everyone. You're going to write some Python code. What editor do you use? I just use Sublime or TextMate also on Mac. But I'm sure I could do something a little better than that. I just picked one and never really looked back.

54:19

Sounds good. And then notable PyPI package? Notable. Maybe not the most popular, but like, oh, you should totally know about this. I mean, you already threw out there PyMC3, if you want to claim that one, or if there's something else. Yeah, pick that. Yeah. Well, I have BayesPy, which is the one that's like in GitHub slash max slash BayesPy,

54:40

which has all the stuff I talked about. It's not actively developed, but it does have my kind of one-off algorithms, which if you're in the market for multinomial models or Dirichlet, or you want some kind of interesting new way to do multi-logistic regression, I could certainly give that a try. But most people probably want to use kind of the standard toolings. Yeah. Why don't I go with that? Why don't I go with the one I wrote a long time ago?

55:09

Yeah. Right on. Sounds good. All right. Final call to action. People are excited about this stuff. What do you tell them? What do they do? Check out the books I mentioned and check out my website, localmaxradio.com. And also subscribe to the Local Maximum. It should be on all of your podcatchers. If it's not on one, please let me know. But it should be on all of your podcatchers. localmaxradio.com. It's just every week. And we have a lot of fun. So definitely check it out.

55:35

Yeah, it's cool. You spend a lot of time talking about these types of things. Super. All right. Well, Max, thanks for being on the show. Michael, thank you so much. I really enjoy this conversation. Yeah, same here. Bye-bye. Bye. This has been another episode of Talk Python To Me. Our guest on this episode was Max Sklar, and it's been brought to you by Linode and Tidelift. Linode is your go-to hosting for whatever you're

55:56

building with Python. Get four months free at talkpython.fm/linode. That's L-I-N-O-D-E. If you run an open source project, Tidelift wants to help you get paid for keeping it going strong. Just visit talkpython.fm/Tidelift, search for your package, and get started today. Want to level up your Python? If you're just getting started, try my Python Jumpstart by

56:19

Building 10 Apps course. Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do in Python. And of course, if you're interested in more than one of these, be sure to check out our Everything Bundle. It's like a subscription that never expires. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top.

56:41

You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code. We'll see you next time. We'll see you next time.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript