#276 Tracking cyber intruders with Jupyter and Python

00:00

Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 276, recorded March 22nd, 2022. So many twos. I'm Michael Kennedy. And I'm Brian Okken. And I'm Ian Hellen. Hey, Ian. Welcome to the show. It's great to have you here. Thank you very much. I've listened to the show a lot and feel very privileged to appear on it. It's our privilege to have you here. Thank you so much for listening. And I know you got some

00:29

cool stuff to share. So we're looking forward to hearing about that. Also, I do want to say thank you to Fusion Auth for sponsoring the show. I'll tell you more about them later. Before we get into the topics, Ian, tell people a quick bit about yourself. Sure. I'm a developer in Microsoft, the Microsoft Threat Intelligence Center. Been with Microsoft for quite a long time. Only relatively recently, like four years so ago,

00:52

got into Python coding with Jupyter Notebooks. So I work on Jupyter Notebooks for the Microsoft Sentinel project and own a modest open source package that we'll call MysticPy, which we'll cover a little bit later. Takes most of my time. Fantastic. The whole cybersecurity threat detection stuff, it's very interesting. There's a lot of innovation there, but it's also, it's a challenging area to be working. Yep. Yep. We're never sure of stuff to do.

01:20

Certainly. I'm sure you're not. Well, Brian, how about you kick us off here? Well, so I'm going to start off with a problem. So I had a problem and I have a cool solution for it. So my problem is on test and code, I've got titles and I want to end a show on it. It's MP3 file, but I want to create a show notes, automated show notes or not show notes, a transcript. So one of the problems, there's a lot of problems in doing this, trying to automate it, but one of them

01:50

is the title. I want to turn that into something that's a little bit, so something like, you know, it's got normal English and capitalization and all sorts of spaces and stuff. I want to turn that into things that URLs hate. Yeah. I want to turn that into a URL. And, and one of the problem, one of the things is getting rid of stop words. So there's a bunch of stuff like lower casing. I can do that easy, but getting rid of stop words was a little hard. So I ran across this,

02:19

this thing called gen sim parsing, pre processing thing. So pre processing. So gen sim is a larger sort of beast. It's a, it's a used for machine or machine learning and stuff to generate models. But I am, I'm just really using one little piece of it, the pre processing part. And it's, it's really pretty cool. I was looking, I actually found this article first. There was an article called removing stop

02:47

words from strings in Python. And it has, it has a discussion of NLTK and gen sim and spaCy. I tried all of them out actually. And the one that really stuck best for me is a using, talked about using remove stop words is exactly what I wanted right from, from gen sim. So I went ahead and tried that and it worked really well, but I'm like, wait, I'm pulling this is in from the pre processing library.

03:15

I wonder what's what else is in there. And there's all sorts of really cool stuff in here. there's a lower lowercase to Unicode. It turns it both into lowercase and in Unicode. That's pretty neat. Don't think I need it, but that's neat. but then there was one that was, pre, I thought maybe this is exactly what I want is, something called pre-process string. And it has a whole bunch of filters built into it. Oh, nice. Like strip strip. Yeah. Strip white space, strip punctuation. I love it.

03:45

Yeah. And take away multiple, after it strips punctuation, like you're going to have, if I go back, I had a slash in my title for one of the episodes. If it takes that out, I'm going to have a space before and a space after. So I want to remove those. So it'll strip multiple white space strips out numerics. Cause I probably don't want numbers in there. and then remove stop words. The one thing I don't want that I'll have to like, customize how I'm calling this is a stem text.

04:13

So stem text, I didn't, I don't know what that did without playing with it, but what it does is it would take things like twisted and turning it, turn it into twist. That's, that's really not right. So you definitely don't want that. I don't want that. I don't mess it up, but I think I want

04:26

everything else. So, this gen SIM, library has, you know, if you're doing machine learning, coming up with models, I think this is a great, tool to look into, but if it's actually, I'm going to use it just for, removing to create these titles for, for, you know, my podcast, but the, I think it, it feels a little weird. It feels like I'm using this really big hammer to do this little tiny problem. I guess I'm okay with it, but you know, do you have any other ideas

04:57

where it could use or, well, I didn't know about this. So I wrote my own. Okay. And it's, it's, it's kind of janky. Like it's a little bit, a little bit recursive iterative. It's like, we'll take away all the punctuation. Now turn all of your white spaces into single white spaces. Cause there might've been, you know, dot space. So now you've got two white spaces, but you've got to take away, you know, there's like a bunch of weird steps and then, then put it back. This looks

05:22

cleaner. It is a dependency, but it does look cleaner. I like this. I think it's, I'm glad I know about it. Ian, what do you think? Is it a huge thing? I mean, dependency, but, I always think of like ML like stuff, but this is like just the pre-processing, right? Well, I'm actually pulling in all of GenSim to get this. I don't know if I can pull in little bits, but, it's, it's not

05:42

really part of my application that I'm shipping. It's just a tool that I'm using on my laptop. So I, I guess downloading it once doesn't really bother me too much, even if it's a big thing. But cool. Yeah. I was thinking, yeah, that's a good, that's a good point. If it's running local, it's like a dev dependency, who cares? Right. It's like worrying about how big pytest is. Like it doesn't really matter. And I'm not, well, I kind of get care about that. Cause

06:05

CI is going to pull it in all the time for pytest, but. Yeah, but they got fast networks. It's not your bandwidth. It'll be all right. One of the things that struck me about this that made me think of your situation is like that lowercase to Unicode in so many times in the security space. It's about like, you're checking for this representation, but what if there's another representation that means the same thing? Like you don't say go to this

06:31

directory. You say go dot, dot. And then over there, you know, those, those kinds of non-canonical representations. I wonder if there's any use of this kind of stuff for you. Yeah. There's something I kind of touch on the pigment section later on, which like the attackers typically write scripted attacks and try to obfuscate code using a mix of kind of uppercase and putting random dots. I'm just thinking that'd be a nice, potentially a nice way of kind of cleaning some of that, that stuff up.

06:56

Yeah, for sure. There was a, there's been some interesting supply chain vulnerability stuff. Remember, remember the guy with the color and I think the faker stuff in JavaScript that sabotaged his, his libraries. There was another one that maybe well-intentioned. I don't know. It, it was some open source library. I don't believe it was Python. I can't remember what it was. It could have been, but I'm pretty sure it was in JavaScript because that's where all,

07:22

most of the bad stuff was, it seems. Anyway, they wrote their, they, they taught their dependency to erase everybody's hard drive who installed it, who was in Belarus and Russia, which, okay, maybe they're trying to contribute, but like it ended up doing a bunch of bad things, even to places that were like trying to help say people in the press and journalists do certain things and then like,

07:46

you know, connect with sources and in a race like that database as well. And what they did to make it so that nobody would notice in the GitHub commit before it went out to NPM was base 64 encode their changes. So I basically put a base 64 encoded string and then like decode and then run that. And, you know, it's like that kind of stuff. I know this won't solve that problem, but yeah, you know, that, that sort of category of like weird representations.

08:10

Yeah. You need mystic pie for something like that. It's one of the things we, yeah, it's a common thing, kind of basics before decoding before the obfuscating. But yeah. Yeah. Interesting. yeah, I thought of maybe using something like that with, because one of the problems we have is like every, every script is kind of slightly different. if you could use something like that, essentially kind of apply like sentiment analysis to

08:34

script. I mean, this is a big problem. It's just not something I've particularly solved. but that might be a kind of useful, useful thing to just picking out certain things that indicates malicious, like format, you know, format drive. Exactly. Yeah. You could certainly represent like this one does hard drive stuff. Is this, I thought it was parsing colors. Why is it doing things with the hard drive? This is odd,

08:55

you know, like, or with the network, stuff like that. Cool. All right. Well, you know what you would really want to check out if you were trying to research these things, probably documentation. So I want to tell you all about dev docs, dev docs.io. This is pretty cool. Now, when you get there, it's an interesting on my Firefox, it's just got like the mobile view, which is really odd. If you go there with a full browser, it's what it believes is the full browser. I guess it's like a slightly

09:20

different view. That's pretty similar, but not the same. So there's, if you open up a whole bunch of programming technologies, let's say not just Python or JavaScript or something, but there's also Vue JS. There's Vexoig, for example, like some of the foundation of flash and you can pick the particular versions and stuff. So you can go in and like enable these different things. So maybe I care about

09:42

view. I can go over here and enable that one. Let's, we definitely want some Python. Let me go find some Python and it gives you all the versions. I'll take that. And let's say I'm also working with Postgres. So I'll enable that documentation. And then I might be working with engine X for the front and which is somewhere right here. So you can go enable that. And then it will be up near the top somewhere here. You can see these are either the default ones or the ones that I checked on. So then

10:09

you can open them up and say, I want to go and see the engine X guide about a debugging log. And then it takes you to the documentation for that technology. So it's like a meta documentation repository for all of these things all at once, which is pretty cool. Right? So I can go up here and search. I want to know about like, let's go about like media tags or something. So you can see the stuff in HTML5. You

10:33

can see the stuff in when you say media, it looks like median. So you can see that in the statistics module for Python, some stuff for CSS, or you could come over and say, look, I just want to search for CSS. And then you get like using media queries and how to do that kind of stuff. So it's kind of a, what you do is you turn on the pieces that are relevant to you, and then you can search across

10:54

those technologies. Cool, right? Wow. Yeah. And, and then if you're on the move, you can come over here and turn on offline, offline data, and it'll download all of that as an app so that then you're the coffee shopper and you're playing, you now have all the documentation for Python 310, Vue.js, Verix.soing, Nginx, et cetera, et cetera, that you can use, which is pretty cool. And this is something that drives me crazy about Firefox. They had it and they took it away. And I don't understand why,

11:24

because I'm feeling as firebox is about what the web. So they took away the ability to do progressive web apps in Firefox, but all the Chromium browsers support it. So you can actually go and install this as a dedicated application on your system. So you, if you have no web, you just click that open. It's its own window. You can up, you know, alt tab, command tab between it. Super easy. And then turn on the

11:47

offline mode. And you basically have an app that has offline documentation for all the programming technologies that you care about. So this is my new coffee shop buddy. Is the search go across the thing you've selected then? So if I search for like replace or something, it's the things I've selected. Yeah. So if you turn on like JavaScript and Python, it would look for that in both languages. Oh, okay.

12:08

Yeah. So basically the ones you turn on, there's a ton of them, right? And you pick, you say, these are interesting to me and then search and stuff from what I can tell only applies to the technologies you say you care about. Cause like if, if you don't use Java, you really don't want to see the documentation for Java search, right? That would be useless.

12:23

Yeah. One of the things I like about this is it also has versions. So, if you're using a, like an older version of Postgres, you can just enable that version. Right. Sometimes it doesn't matter very much, but other times it matters massively like bootstrap three and bootstrap five, they're like fully incompatible basically. Like they're totally different keywords and grid systems. And you don't want just the latest. If you've got an old app

12:47

you're working on something like that. Python's more forgiving about that kind of stuff, right? It doesn't break as often. I was amused that the list though is, it has like three, nine, three, eight for Python and it has three 10 at the bottom because one is obviously. Cause it's alphabetically sorted. How interesting. Ian, what do you think of this?

13:07

That's very cool. I'm amazed. Is somebody at dev docs kind of manually maintaining all of the links to these, like the original source documentation? Yeah. Where are they getting it from? Right. I mean, cause there's, they're super disparate. It's like matplotlib and markdown and MariaDB. These are all, it's unlikely they're all stored in the same basic system. Right. I don't know how they get them actually.

13:29

Yeah. That's very cool. I mean, I know, I normally have solved the same problem by having like 130 tabs open to different bits of Python docs and pandas and. Exactly. Exactly. Yeah. I'm pretty sure they got pandas in here. They got numpy as its own thing that we saw matplotlib. There's pandas and there's even, you know, versions of pandas across there. Single tab solution. Brilliant. Yeah. It looks, looks pretty good to me. All right. You want to tell us about what you got for your first item?

13:58

Okay. Sure. Yeah. so, as I mentioned earlier, I own a package called mystic pie. and first thing to sort out with it is the spelling because I suffer from this on a daily basis, mistyping it, even though I've owned it for like three or four years. So it's MSTIC standard for Microsoft threat intelligence center. There's no why or anything like that in there. So it's a tool set for cybersecurity investigations and hunting in Python, mainly in Jupyter notebooks. So there are a

14:29

couple of questions to ask about that. So firstly, what is cyber security hunting and investigation and what it, why are Jupyter notebooks useful? So the first one, cyber sec investigation is really responding to alerts or other kinds of threat intelligence and trawling through typically large amounts of security logs from cloud services, hosts, account services to determine whether this is a real threat

14:53

or not. And there are two main kinds of... That's one of the huge problems, right? Is you've got all these different systems. How are you going to know if someone, if you don't have a tool like this, how are you going to know that something, someone's in there rooting around, right? Yeah. Yeah. And there are a couple of things that usually trigger this kind of search. So one of them is a, an alert may be coming from your seam and that's a, that stands for security, information,

15:17

event management. So the, like a console, like, ArcSight is a traditional one or Microsoft Sentinel is a cloud-based one. so you get an alert based on a rule and you need to go in a fairly managed process.

15:30

Somebody needs to go and investigate. Is this a real threat or is this just noise? or there might be something like the solar winds, they never a year ago, the log four J, like something in the press or something from a threat Intel kind of alert says this kind of threat is around and that's a more ad hoc process kind of hunting. Like, do we see this in our organization? so that's kind of what mystic pie is trying to, you know,

15:55

try to address the needs of that. and the second question is why Jupyter notebooks? Why would you do any Jupyter notebook rather than in your existing sock tools? I mean, I think there's a lot in common, this kind of

16:08

activity is a lot in common with like big science data, sorry, big, big data science. I mean, something like astronomy where you're kind of, you know, hunting for an adversary activity is a little bit like trying to find an exoplanet in kind of gigabytes of data or a new quasar or something like that. a hundred thousand stars or a hundred thousand lines of log file and you're hunting for some patterns and stuff.

16:31

Right. And you've got a few photons you're trying to determine are these kind of different, you know, something like, like, an adversary activity is a little bit like that. It's like millions and millions of events and you're trying to find the bad stuff. so traditional sock tools are kind of, you know, can be really excellent. And I work with one that I think is, is really good, but, but they all have limitations.

16:50

What's a, a sock tool, a sock tool, a sock security operations center. So, so something like, you know, a console that fires alerts and tells you that they have a bunch of analysts, engineers looking at this output of this and deciding, and that's the trigger for their investigations. They're like, is it like failed log in the SQL server?

17:10

Yeah. Something like that. Or, you know, it could be more sophisticated thing. Like, something's exit, you know, tried to access the kind of password data on this, or looks like it's trying to access the password data on this host or, or has made a weird kind of configuration change to, mailbox settings. So all those kinds of things can kind of trigger alerts and investigations. but you are limited in most kind of operation center environments. Notebooks allow you to kind of break out of some

17:39

of the constraints of that. So firstly, you can get data from anywhere. you're not just limited by kind of what's in your logs. You could go to virus total or so you can bring data from anywhere. you can use customized kind of analysis. so write your own or get, get things from PyPI. Lots of people have kind of written this stuff. you control the workflow. So, so you don't have to follow what the tool says. You can reorder things, you can backtrack, redo things, and the workflow is repeatable.

18:08

So if you get a similar kind of, you know, issue again, or similar kind of alert, you can fish out an old notebook and rerun the same kind of analysis. And you end up with a nice kind of shareable document that, it describes your investigation a bit like the results of a scientific investigation. It's like, here are all the steps I took and these are the results. And this is what they, this is what we determined to be the bad, you know, the bad activity.

18:33

Right. The other thing that seems useful here is Jupyter. Often the notebooks will save the last bit of computed information. And then you can go, you know, change a cell, ask the question again, change without rerunning the whole thing. And like that's parsing tons of logs or pulling them over SSH or whatever that not doing that again is nice. Yeah. And it's brilliant. If you don't like doing lots of queries in different browser tabs and your browser crashes, they've all gone. What do you do?

19:01

It's all in a Jupyter notebook. I say, it's like second by second, after you do it, you can just go back and you can go back to things like you may have done months ago. So, yeah, absolutely. Yeah. So, so when I started all of this, I kind of thought a lot of this stuff for cyber investigations would be available on, and PyPI. I thought great Jupyter notebooks seem like brilliant. And there's going to be process tree viewer and there's going to be an event timeline and all this

19:25

kind of stuff. and I found out there wasn't, at least I couldn't find it. so I decided to just like stop everything. Need to start writing this, this stuff. So it turns out that things like visualizations you need for detecting exoplanets are a bit different from ones you need to detect, uh, bad actors. So, so we started building this thing originally me, but there's now, Pete Brian and Ashwin Patil also kind of, working on it to my colleagues and a bunch of people in the

19:55

community. It's got four main functional sections. It's like data querying, how you get data in, how you do templated queries as enrichment. So for example, if you have something like an IP address, you might have a bunch of questions about it as an analyst, like which geographical location is this IP address from does it, or any malware reports about it. third areas analysis are things like anomaly identification, like the thing you've talking about a spike in, in failed logon events,

20:25

unusual spike in failed logon events, that kind of thing. the final area is visualizations, and these are like more specialized. I've got kind of a couple of examples in the show notes. this is like anomaly identification pattern. This is one of, one of the custom, we use Bokeh, uh, which I really, this is really nice kind of visualization package, to allow you to kind of view data in a way that analyst kind of expects you to s to see it a bit. So they're more of this

20:51

kind of visualization than more traditional kind of graphs. I would much rather look at this than log files or event logs or, or whatever, you know? Yeah. That's the whole thing about, you know, you, you, you need, you may have thousands of events and you need to get down to the few that are the interesting, the interesting thing. so one of the areas that we've, we try to focus on currently, cause we wrote all this stuff and you have like hundreds of functions that you could use,

21:15

but it's kind of difficult to discover them. And they all, cause they evolved a little bit organically. Like how do you, they were working a little bit of a different way, different set of parameters. So the work we're currently doing is trying to make this all a bit more accessible. So all of the functions that relate to say an IP address, all the questions you want to ask about it are kind of dynamically attached to a class called IP address. So they're all like things like,

21:41

Oh, interesting. Do, do, do. So you don't have to work just with a raw string or just some raw IP representation, but you can ask it questions like its location. Well, it's not quite that intelligent. It's even a bit less intelligent than Alexa, but, but it's, but it's more like, you know, there might be things like geolocation of an IP address, threat intel lookups, different queries that might be, have IP addresses like a, a parameter.

22:07

and previously you'd have to go and find all of these things and import them separately and run them. but now they're all kind of dynamically attached as methods to the fact that use IP address as a parameter means that you just have one object to import, and then you can do all of these different operations, on this single item. there's, there's some things that don't work with that. Some things like the visualizations, for example, they're not IP address or host or account specific.

22:33

They work on big blocks of data. So the other area we're working on is try to anything. It takes a bunch of data as an input. We're writing those as pandas excesses. so they appear as methods to a data frame. So you do kind of data frame dot MP plot dot timeline, right? And it would produce your timeline as long as it's the right kind of data or, so yeah, that's one of the challenges of writing this kind of thing organically is you end up with a lot of stuff, but nobody knows it's there

23:02

and everybody knows how to import it. So try to make it as accessible so that it just becomes a very intuitive thing. Oh, I have an IP address. What functions can I do? I can do this, you know, it's all like tab completable, that kind of thing. Yeah, I think it's really cool. You've taken this Python data stack view of cyber security and threat detection. Yeah. Yeah. Brian, what do you think?

23:23

well, it's definitely a complicated area. and it trying to, one of the things I like about this story is just talking about the complexities in API design, and discoverability that's a, that applies to like lots of different fields, but yeah. Yeah. It's one of those things you should have thought about at the beginning, but, even at the end, you can tidying things up. yeah. So, Famous last word.

23:49

So yeah, we're definitely open for like other people collaborating, contributing stuff, cause there's a lot of ground to cover. yeah, for sure. It's on GitHub. I saw one final question before we move on. Is it just for Azure or is, is this a thing that more broadly works across different systems? No, I think I should have mentioned that a little bit earlier on it. We originally built it for Microsoft Sentinel notebooks, but it supports like Splunk, Defender,

24:16

working on an elastic provider. So really anything you can get into a pandas data frame, you can use most of the functionality. So even if we don't, we don't have a provider ourselves, if you've got something like PySpark and you can get a data frame, then all of our functions take data frame. You know, we use pandas as our universal data interchange format. Yeah, indeed. Indeed. Kim Van Wick out in the audience likes it. It's way like a much nicer way

24:44

to glean info and logs and complex grip. I'm, I'm right there with you. All right. Now, before we move on, Brian, let me tell you about our sponsor for this episode. This episode of Python Bytes is brought to you by Fusion Auth. Fusion Auth is an authentication and authorization platform built by devs for devs. It solves the problem of building essential user security without adding risk or distracting from the primary application. Fusion Auth has all the features you need with great support and

25:12

a price that won't break the bank. And you can either self-host it or get the fully managed solution hosted in any AWS region. Do you have a side project that needs custom login and registration, multi-factor authentication, social logins, or user management? Download Fusion Auth community edition for free. The best part is you get unlimited users and there's no credit card or subscription required. Learn more and get started at pythonbytes.fm/fusionauth. The links in your show notes.

25:41

Thank you to Fusion Auth for supporting the show. All right. What do you got for your next one, Brian? Number, numbers, something every computer scientist should know? Yes. Floating point. Arithmetic is complicated. And so when I started, started working in professionally, one of the things I was recommended reading was, an article called what every computer scientist should know about floating point arithmetic. And don't worry, it's only like a

26:06

really long paper with lots of math. so I am not telling you to read this, although it is an

26:13

interesting read. What I would like you to read is this article by David Amos called the right way to compare floats in Python, because there's a few things that we need to know about floats when we're using them and floating points is, and he covers all of this in the article without going through tons of scary math is the floating point numbers have to be represented in a way that can the computer can store them and use them and manipulate them, even though some numbers are huge and won't fit

26:43

normally. So we have to do things like accept that there's error and rounding. So there's a little bit of a discussion there that he talks about. One of the things that surprises people sometimes when they first come come into Python, but it's not just Python, it's most, most languages is somewhere. There's going to be something obvious that doesn't work like in, in Andy or David's example, 0.1 plus 0.2 equals or comparison equals, 0.3. And that will show up as false because they don't.

27:14

And this is weird. They obviously are crazy that that doesn't work, but, but it's not just equals. You can also do comparisons like, you know, less than or greater than. So it's not only is that, are they not equal? They're not like 0.1 plus 0.2 is not even less than or equal to 0.3. It's weird. so, so what do you do? You don't, the gist of it is don't compare things with a normal math comparisons if there's floating points involved. So what you want to do instead is, and there's,

27:49

here's a little tiny bit of math, way less than the, than the example. the thesis, the dissertation. Yeah. so there's a whole bunch of stuff built into Python that you can, um, to, to, to work with comparisons. And one of the most common ones I'm trying to get there is, math is close. So there's a math library that's, it's that with an is close function that it's used to just say, Hey, I've got two values. Are these close, close enough? and,

28:18

we, when, if you're using, if you have to compare floats, something like this is, is great. And be underneath the scene behind the scenes, what it does is it's, it's taking the two values and subtracting them and figuring out if the Delta is, or the absolute value of the Delta is below some tolerance, some reasonable tolerance, like close enough. And what that tolerance is,

28:41

is either a relative or absolute tolerance. And, you, most of the time you can kind of get away with not caring about that, but if you do care about it, you can control that you can pass in what tolerance you expect things to be closer to. I use stuff like this all the time with, with test equipment, because I definitely want to know, control over the tolerance levels. So, yeah, for sure. So there's math is close, but then there's also, I'm not going to

29:08

scroll all the way down here, but there's, there's, he also covers numpy. So numpy has got a couple of these that are really great. One of them is, is, is close also, but it works on arrays and it'll give you an array of, true and false values, but you can also use all close, which just says you've got two arrays. And if all of the pairs are close enough, it'll match those up. also covered, which, we use during testing a lot is py test prox,

29:37

which is a little bit of a different beast, but, but David covers that. So, basically this is a semi regular reminder to anybody using floating point math in Python that you should be careful with it or any other language. So. Yeah. It's not a Python thing. It's just a fit representing things that don't fit. Now there's some things sometimes where you have to be very exact. You need to be very precise. And in those cases, Python does have the decimal and fraction types.

30:05

and David covers these in the article, which are cool. They're cool things to know about, like definitely around, people using money or, or other, very high precision. But if you're also, so there's, those are covered. They do get some sort of a hit for those. But if you really care about, like the precision and want to want to do things exactly right, then you probably should read that larger article because there's things that you have to do like, certain operations before

30:34

other operations to try to keep the area error from accumulating too high. So there's, it gets messy. Interesting. I think I'm fundamentally disturbed by the idea that zero isn't zero. So my approach to floating point numbers is normally convert them to ints. Yeah. I was thinking that, yeah, sometimes that is the way to do it. Right. I was thinking this kind of stuff maybe applies a lot to the project that

30:59

you're working on. If you're trying to come up with ratios that represent, you know, how risky something is and things like that. Yeah. Yeah. Yeah. I mean, certainly a lot of, yeah, I was being a bit flippant before. It's just as fun. It's like, I'm a very platonic at heart. I think so. Like zeroed one should be zero one, not nearly one of nearly zero. There should be a perfect square in a perfect circle. Like how can they not exist in our language? Is it really zero or negative zero?

31:27

Henry on the audience. Henry also points out that PI test approximate also works on numpy arrays as well. Nice. Which is pretty cool. Cool. You can put that all together. All right. Let me tell you all about Piper. I think that's, Cool. that might be the representation, the way you pronounce it. Everything needs its own description, its own like little phonetic bit. So this, this is a, a simple way to create scripts that run and do

31:57

stuff on your computer using Python. And what's cool about it is it has a real simple way to define the steps. Some of those steps can be optional, but then you can also piece together things like other programming. So you can combine commands, different scripts in different languages and applications all into one sequence of events that happens on your computer. So it's basically a task runner where you define stuff in YAML. And probably the best way to see is to go check out the docs. And

32:25

there's a whole bunch of docs. The docs are really nice here actually. So for example, if you go to getting started and come down here and run your first pipeline, I really like the way the docs here look, how they look, but the way you define it, here's like a one, one step one is you just say the steps and it's all YAML and give a step a name so you can refer to it. And then you have inputs and outputs and outputs and you do the little curly string interpolation types of things. Or you can

32:51

have more complex ones like with different steps and you can even have little comments. There's a way to put a comment in your YAML file as well. So there's also conditional. Let's see if I can find a good conditional one down here. Here's on it goes and works with like, this one is just an echo statement and the ping command, but you know, whatever, whatever you want to do, you can basically pass command line arguments to the YAML file or to the workflow, the pipeline, and it'll take those and

33:21

feed them into the steps. So for example, when you call it, you can say like count equals one and IP equals that. And those will come the little string interpolated pieces that go in there. So you can just combine whatever, basically whatever commands are available to the shell, right? Be that Python or POSIX or windows or PowerShell or whatever you're looking to do. Pretty cool, huh? Hmm. That's pretty neat. I might need this for my, my job of, automating my show notes.

33:49

I might use some of this. Oh yeah, there you go. If you can find this, go do that. And so on, like, here's one that sort of uses the truthiness. So it says there's a bunch of different steps and the, you can use the run flag. So here it says run if there's a value for a on this one. And this one says run if there's a value for B. And then there's an example where it says, okay, we run it by itself. Those don't run. But if you pass a, then it runs that a step. If you pass B, it does the B step,

34:15

or it can do both if you pass them both. And I like the simplicity of it. Like a lot of these tools, like this feel like they're pretty complicated. You know, you're sort of like your example with the Genshin, Brian, where you're like, is this thing too heavy weight for what I'm trying to ask it to do? You know? And this seems like a real simple thing. And I don't have to learn about make or any of those kinds of things.

34:34

Yeah. GitHub actions or, yeah. Yeah. Yeah. It's got a bit of a GitHub actions feel to it. That's, but it seems like a nicer kind of declarative. That's really cool. Indeed. Yeah. If you were not, not into programming or you didn't want your steps to be programming, but of course what happens at each step, you could call a Python app or script. That's going to do something complicated, right? If it needs to, can you, can you, the orchestration of that,

34:59

you don't have to make complicated. Is it just a command line too? Or can you invoke it from Python? Might be a bit interesting. I'm sure there's, there's a way to import it and make it do, do a thing. You know, it's probably just a Python package with an entry point in this package. So I would think so. Yeah. Cause it would be nice to be able to do that rather than just using sub process to invoke a lot

35:18

of things. Like if you're in. Oh, interesting. I hadn't really thought about it as a replacement for sub process, but yeah, because a lot of times when you're trying to orchestrate stuff, like it talks about here being part of the shell or being another app or another language, you would just use sub process on it. Right. Yeah. Cool. Well, there it is. Piper, Piper.io and people can check that out. It looks,

35:40

looks pretty interesting. Nice. All right. Ian, you want to take us out with your final item here? Ah, pigments. Okay. So this is a package. I mean, if you were a developer, there's a very good chance that you have been using this for years without, like me, without knowing about it. You might have seen it being installed as like a dependency. It's like, what is that thing? That was my thought, Ian. I'm like, I know I see this all the time in my dependencies and I just never really bothered to

36:03

look into what it does. Yeah. So I hadn't until recently. So if you use, if you use Jupyter Notebook markdown, you know, you can look like three backticks and, and then a block of code. And you can actually put like Python or bash or something as a, and it will intelligently highlight it. So the thing that's doing that intelligent highlighting is pigments, GitHub markdown, same kind of thing. Although I'm not

36:28

sure whether GitHub uses pigments. And if you do developer docs, like reader docs and Sphinx, that also uses pigments to kind of color code your, your code samples. And I know there's a lot of, uh, you know, writing kind of blog posts and stuff like that. You, there are some, quite a few services out there where you can take a chunk of code and it will, intelligently highlight it and give you a, a JPEG or a PNG back. And that's kind of nice, but then you can't copy

36:56

and paste the code from those samples. So I don't like that really. I think if you're going to put code in a article, you, you're probably intended for people to be able to copy and paste it. Yeah. That's the most likely thing you are to copy and paste. Yeah. Yeah. Right. Cause you want that code over here. Yeah. You don't want an image of your, I mean, cause you could use OCR to like reinterpret it, but it's all, yeah. And then maybe, maybe Brian's gen sim to like, tidy it up.

37:19

but, so with pigments, you can use it as a standalone package and it can do this kind of rendering, and it can render to like HTML with like CSS style sheets for all of the coding. It also rendered to like NC terminal, latex, a few other, other kinds of things. So if you're using, um, you know, if you want to get a nicely formatted piece of code in, in a document or you're doing developer docs, it's certainly kind of useful. I mean, I came across it. or should I just say

37:50

one thing that also supports, maybe I can just switch supports lots and lots of languages. So it's, um, very simple to use. It has a highlight function. and then you import Alexa, which is like the thing that understands the tokens in a language and the, a formatter for the output type you want. And I think there's hundreds of these things. So, and, and, and there are a lot of languages in there. No kidding.

38:13

I'm more than half of these I've never heard of. And it also supports as well as things like, you know, you'd expect Python, it supports Python tracebacks. So it has separate Lexer for color coding tracebacks. all the usual languages you'd expect, but also some things like data formats, like, Toml, Jason, XML. okay. Interesting. Like a lot of the files that we might run across. Yeah. Yeah. Yeah. and so it's very, very easy to use. And the reason I came across it is because I,

38:44

it recently, so a lot of attacker code tends to be a deliberately obfuscated. So it's kind of base 64 encoded, but then even once you decode it, it's kind of munged in a way to make it as unreadable as possible. So one of the things that we try to do is, is pull that code back, like decode it, trying to re

39:03

like clean it, deobfuscate it. but if you have, if you can present it in a, as close to the way a developer would write it as possible, it makes it much quicker for an analyst to determine what, what is this doing? so we've used it now in, in mystic pie to kind of, color display things like, well, it's just power shell script or, bash or something like that. So that's how I came across it. Actually, rather than just seeing it go past as part of a pip install, actually have to invoke it

39:32

directly. So, so I kind of big shout out to the developers and maintainers of pigments. It's one of those package that probably millions of people benefit from, but like very few people kind of know about it or, you know, you can, and it's just super easy to use. They seem to be adding kind of flexors all the time. So, great. Yeah, this is amazing. I didn't realize that it did all of this. This is a way more advanced than I thought. Brian, did you know? No, I just thought it was

40:00

something that magically syntax did syntax highlighting. So I didn't have to care about it. Yeah, exactly. I got a little example in the, in the show notes as well. I posted it has a dark theme. Yeah. Yeah. yeah. And you, you probably want to include this no background equals true if using a Jupyte Notebooks. Cause if, if you select a theme, it just flips the whole notebooks kind of CSS theme. So that tells it just not to mess with what, what's in the background. Okay. yeah,

40:29

that looks great. Yeah. Thanks. Thanks for pointing out how useful that can be. That's, that's cool. Like I said, I've seen it go by all the time. I just never really paid that much attention to it. It's probably a pretty minority use, but like if you need it, it's great. Yeah. It's incredibly powerful. Fantastic. Well, that's all of our main items. Brian,

40:46

you got any extras? just one extra, actually. One of the things when I was doing that, the first topic with GenSim, the, one of the dependent, it doesn't have very many dependencies, but one of the dependencies is this, this library called smart open. And I'm like, what? I, I open things and I want to be smart about it. So I wanted to check this out and it's pretty neat. I don't know if we've covered this before, but it's a, it basically mimics the interface of open

41:15

normal Python open, but you can pass it really anything in. And it does, like, transparent on the fly reading of things, efficient streaming of large files from like S3 or Azure or, or over the web. Even straight just HTTP. Yeah. If you just have a link to a large file on a web server. Yeah. And, and then just the code for it is just like super nice. You know, you, you import open from smart open and you got like four line in open this thing and, just, you can work from each

41:49

line there. It's pretty cool. I love it. That's a, that's a great one. Very nice. Ian, you got any extras you want to shout out while we're here? I don't, I'm afraid. I have, I have, I have two real quick ones, to just quickly talk about. Last time, Emily Morehouse spoke about using auto squash, which was really cool. So Adam,

42:14

let me get the attribution correct here. Adam Park Parkin sent in a follow-up to say, hey, you should check out this article over here called fixing commits with git commit --fix up and git rebase --auto squash. Woo. The long and the short of it is talks about doing a lot of things that Emily said was pretty cool, but in the end setting up your.git config to auto squash equals true, and then adding an alias.

42:42

So you can just type git space fix up. And when you type that, it actually does get log and shows the last 50 items and then allows you to go back and work with those. And basically it's just a real quick way to get back into the scenario where you mark different elements for fix up. So people can check that out if they were following Emily's advice, but they want it to be like one line. They don't have to remember. There you go. That's cool. And then Python 310.3 is out as of about a week

43:11

ago, I suppose. So there are many changes amongst here. You know, I would love, there's like so many great changes here. I don't know how many do you think that is probably a hundred, maybe a little bit less. It would be great if there was like a, these are critically important at the front. Like there's a security problem that was fixed, or there's a thing we've taken out is no longer here.

43:33

They're kind of all the same priority. But nonetheless, there's a bunch of changes that people can check out and upgrade to the newer version of Python 310. different people care about different stuff though. I know. I don't want to impose my importance on other people's importance. Yeah. So it's funny when I first came across, first came across Python, you were kind of like, why is it so slow between the major versions coming out? But then suddenly

43:57

it's like a Python developer. It's like, why are the versions coming out so quickly? Yeah. It's definitely true. There's a ton of change. This is just, you know, some minor version change that has these, all these changes in here, which is pretty cool. Well, we also used to be on an 18 month cycle and now we're on a yearly cycle. So just yeah. Yeah. Lucas Schlinger's fault that we are 50% faster now. Thanks Lucas. All right. How about a joke to close out the show? That'd be great.

44:24

Yeah. So here's a good tweet and it's this sort of perplexed, I think in a good way, character wearing all these, are these prizes? I don't know. Anyway, Python developers, when someone asks what their secret is, and this person just says, I just keep writing pseudocode and it just keeps working. It's a little bit like that joke where they have some code, pseudocode in a text file. They're like, just rename it to .py and try to run and see what happens. Anyway, that's the joke. Nice.

44:56

Thank you, Brian, as always. And Ian, thanks for being part of the show. Thank you. Great to have you here. Thank you very much both. It's been a real pleasure. Yeah, it sure has. See y'all.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript