#20 Finding similar but not identical images in 128 bits via Python - podcast episode cover

#20 Finding similar but not identical images in 128 bits via Python

Apr 05, 201724 minEp. 20
--:--
--:--
Listen in podcast apps:

Episode description

See the full show notes for this episode on the website at pythonbytes.fm/20

Transcript

Hello and welcome to Python Bytes. This is episode 20 where we are delivering Python news and headlines directly to your earbuds. I'm Michael Kennedy. And I'm Brian Ockin. And we've got a bunch of stuff lined up for you today. I'm really excited to share, especially this first article, which is so clever that you chose, Brian. Before we do, I want to say thank you. Thank you to Rollbar, who's back to sponsor a bunch more Python Bytes. And we'll talk more about Rollbar later, but thanks, Rollbar.

That's awesome. Yep. So we were just talking about pictures. Like I have many gigabytes of pictures. And if you ran a website that accepted uploads in large numbers of pictures, how do you deal with all that data? Especially there's probably a lot of duplicate data, right? I'm not sure. And so this is an interesting article. There's an article from jetsetter.com. And they're an invitation-only travel community. But the article is duplicate image detection with perceptual hashing in Python.

And that actually sounds more... Perceptual hashing. That's awesome. Perceptual hashing. It's awesome. And the idea is they've got... I mean, the site's got a bunch of pictures of different places around the world. And they don't want pictures that are mostly close to each other. I mean, for family photos, you got a ton that are close to each other. But I get for like... There's a lot of cases where you don't want things that are almost the same.

Right. Like pictures of hotels or pictures of a marina to say, here's the view out of the hotel. Like if they're going to have a listing on like some location of some hotel and they ask people to upload them, they don't need like 100 ones from this one view. And if you check out jetsetter.com, it is an intensely photo-heavy site. Like I'm pretty impressed with the number of photos on that page. With the idea of perceptual hashing, I was definitely interested in reading about this.

And I expected it to be a fairly complicated algorithm. But it's actually ingenious. And it's a... They use Python and get... Transfer the image down to just a 9x9 square. I don't get... Of gray values even. I don't get how that's enough information. But it is apparently enough to determine whether or not an image is close to another image. And they do a delta. I'm not going to be able to... Can you explain that much better? I can try.

I mean, when I read, we take a 5 megapixel image and we generate a 128-bit hash. And that means a thing. Like that means uniqueness. Or actually it means similarity, which is actually more important. I was like, okay, I have to figure this out. And I guess what they do is they take a large image and they like average it down to a 9x9. They say for larger images, like a 17x17 image.

And to determine the similarity, maybe somebody's off by 5 feet to one side or the other to take a picture of a hotel or a view or something. But if you kind of average it down to that 9x9, that's where the similarities kind of collapse into those grids. And then they run an algorithm on that grayscale grid, right? Yeah. And then the interesting thing is that, of course, it's clear to me that you could come up with a hash algorithm for an image.

But the difference in the hashes is enough to tell you how close the image is. Yeah. And it's actually the opposite that really blows me away is like two similar images that are not the same generate the same hash. That's what's the magic. Like that totally blows my mind. I could see like, well, obviously hash is different, images are different, but images are similar, not the same, hash the same. That blows me away.

Yeah. And I like it that it's not that complicated of an algorithm and it's a fun read. Yeah. That's, you know, so I think there's a couple levels of interesting that you brought up this article. And one of them I think is really interesting is when I first heard that, I thought, okay, one, this is going to be super hard, super computational. Two, maybe this is like machine learning or something like that.

Like two machines, like two images given to an AI, like a deep learning neural network or something. You say, yeah, these are sufficiently similar in ways that I don't really, people don't really understand. But magic on GPUs and lots of, you know, neurons, it works out somehow. But the fact that it's really, really a simple algorithm is what's, what's I think kind of special about it. Right. It's like, hey, there's still lots of places to be clever and not just throw AI plus GPUs at a thing.

Yes, definitely. Yeah. And not only that, you get to take it with you, right? It's available on GitHub. Yeah, they do have it. It's a, what is it? P-Y-B-K-Tree? Py-B-K-Tree, whatever that means. Okay, awesome. I'm sure it's part of the algorithm. Excellent. So keeping with open source projects that you can go find and just grab and do cool things with, one of the listeners pointed me towards, pointed us towards Google open source.

In fact, it was the guy from Google Fire, Python Fire, which we'll talk more about later. But he has one of the projects there. And on Google open source, they've basically created like a listing directory of all of the open source projects. Now, many of the projects still live on GitHub. But this is like a place where you can go search and analyze and discover projects from Google. And what's cool is you can sort by language. So show me the Python projects. Show me the C++ projects, whatever.

So I grabbed six or seven interesting projects. I just wanted to run them down for you, Brian. Okay. Yeah. So one of them is subprocess 32, a reliable subprocess module for Python 2. Apparently, subprocess built in is not reliable for Python 2. I don't know. But I didn't know that either. That's partly why it's interesting to me. But also, you know, there it is. That's cool. Grumpy. We've talked about grumpy before. Grumpy is Python on Go instead of Python on CPython.

Yeah. Yeah. That's a good one. That's a... Python Fire, of course. Python Fire, of course. Like I pointed out, that's a way to take any Python object or module and turn it into a command line interface. There's a Python client for Google Maps services. So if you want to consume Google Maps from Python, do it. There's Hue, H-Y-O-U, a Python interface for manipulating Google spreadsheets. That's cool, right? Okay. I'm going to have to try that out. That's neat.

Yeah. I mean, I've seen the stuff for working with Doc. XLS, X files, the Microsoft Office ones, but I didn't know about the Google spreadsheet. So this is cool. Another thing that's always tricky for me is working with OAuth, right? There's always this, like, I've got some app. The app needs to go, like, open a browser window, and there's some sort of funky callback, and things happen. And so one of the places that's especially challenging, I think, is over a command line interface.

Well, there's OAuth 2L. I think it's L. And what that is, is it's a way, a command line tool to get an OAuth token. Just let that sink in for you. Okay. So I want to log in as Google. I can do that, like, through my app. Like, I could basically create a shell script that, through the CLI, gets an OAuth token from the user. That's pretty interesting. Okay. And also, I talked about the Google Maps API.

Like, that sounds like that's something that's really hard to, like, unit test or test at all without actually going to Google. So there's a mock maps API. So a small little app engine app for testing, like, basically mocking out Google Maps API. And last but not least, TensorFlow. The amazing deep learning, machine learning stuff. That's about 50% Python, 50% C++, and a lot of GPUs in action there.

And I don't know where I read this, but I think that this Google open source location is not just all projects. It's projects that they consider still active. Okay. Yeah, that's cool. I mean, obviously, you don't want just, like, a dumping ground, right? Yeah, cool. I mean, everything in there looked pretty neat and fresh, so it's good. It's a fairly neat interface, too, with, I guess, panels and stuff. Yeah, it's worth checking out. Okay. What do we got next? Oh, next is me.

Yeah, more machine learning type stuff. Yeah. So there's an article from Jason Brownlee called, and I just clicked away, How to Handle Missing Data with Python. And this is something that I definitely deal with measurement values that deal with at work, but there's, the gist of it is, is a lot of times you're dealing with a lot, large or small data sets, and some of the values are missing.

And there's a whole bunch of different ways you can deal with missing data, but there are a few of them that he talks about are replacing, you know, you have to know what the magic number is that some data collection will fill in a zero, maybe, if there's no data, or some other known number. But all your math is going to get messed up if you actually just leave that there. So there's a couple ways to get around it. One of the ways he lists is using magic, not a number values.

And I think pandas can deal with that correctly and not average those in. Yeah, what I think is really nice about it is like, I could be given a CSV file or some sort of data thing, set of data, and I could like work my way through it and maybe find the bad data and fill it in potentially. But his fix are like, you run this one line in pandas and magic happens and it's better, right? It's like the fix is so much better than the fixes that I would come up with.

Yeah. And I do like that he's talking about different ways to deal with it with numpy, even without pandas also, because you might not be using pandas. But the, like one of the ways you would do it with any math package really would be to, oh, I guess I don't know how to do that. Actually. Nevermind.

Filling in the, you'd somehow have to find all of the values anyway and fill them in with, like one of the ways is if you're, if you're calculating an average, calculate the average of everything else and then fill in the blanks with the average. Right. I guess it depends on what you're going to do. Are you going to average it? Are you going to max it in a minute? You could like push that through, right? Yeah. Yeah. Interesting.

The best solution definitely, I think is, is using the not a number and letting the, the libraries take care of it for you. But it's definitely, I wanted to bring this up partly because anybody that's working with data collection or, and doing math with that has to deal with the fact that sometimes there's not numbers there and you have to deal with it. So. Okay. Awesome. He's from machine learning mastery.com. I think. And he's got just a ton of cool stuff going on over there. Right.

It's not just this one article. Right. So if you're into these kinds of things, definitely check it out. Yeah. It looks good. Okay. So what's up next is the hug rest framework. But before we get to them, I want to give roll bar a hug. Roll bar is awesome. I've been, as people know, I've been using them for a long time on the websites and the websites are getting more and more traffic.

And I recently, I'm not sure whether it was a wise decision or not, because I'm really busy with other stuff, but I just got really frustrated with the way my servers are working, the way I could sort of move them around and performance and stuff. So I said, that's it. One day I just woke up, so that's it. Converting it all to MongoDB. And so that was last week. And that took like three days of rewriting all my sites to Mongo, which I really think Mongo is the right choice.

And I'm just loving the way it's working now. But that was a pretty serious, like take the guts out of all my web apps and stick in a new set of guts that are similar, but not entirely compatible. I spent a little time with roll bar and they, they, they helped me out and find a few problems like where maybe types used to be strings. I compare them where one was no longer a string and they didn't compare the same. So I got weird errors, but roll bar made it super easy to track that down.

So if you want to have reliability and most importantly, awareness of the state of your apps, plug in roll bar to your web apps. You can use it in pyramid, flash, Django, whatever, just plug it in and you'll get notifications right away. So be sure to visit rollbar.com slash Python bytes, and you'll get a special offer to get started there. And I bet that you definitely noticed those messages, but I didn't even notice you were mucking with things.

And I'm pretty sure that nobody else did or very few people did either. Yeah, that's true. And thank you for saying that. But I actually know how many people ran into problems, right? There was a couple, but I got an email from a couple of people saying, Hey, I had this problem with your app. I'm like, I know, but I didn't know your email address. But I know what your problem was, and it's already fixed. I just couldn't contact them. So because they hadn't actually created an account yet.

So it was really nice to be able to just say, yeah, actually, the problem you're telling me is already fixed. I just couldn't communicate that back to you. Really sorry about that. It's awesome. You seem like a big team then because of that. So oh, yeah, definitely. It's all the folks here in the cubicle farm. We're busy. You know, one of the next things that I want to do is build some nice APIs. And I think it's really an interesting time for the web in Python.

There's a lot of flowers blooming, if you will. Right? We've got Pyramid, Django, Flask. Those guys are all doing super stuff. And like most of my stuff is Pyramid. But we've got Jopronto coming along, Sanic. And another one that I just learned about is called Hug at Hug.rest. How's that for a name and a domain? Yeah, actually, it is. It's www.hug.rest. Hug.rest. That's beautiful.

So Hug is a Python framework, web framework, just specifically for building restful, documented, documentable, versionable APIs. And it's built both for like super simplicity and flexibility as well as performance. So I started looking this up. Wow, this is quite interesting. Okay. So the idea is you can create an API once and you can consume it in all these different ways. So you can import it as a module or a package into your project and use the API that way.

You can communicate it, obviously, over HTTP as like a RESTful API. Or it also has a CLI, command line interface, way to expose that. So if you write like some kind of a web app or functionality you want to expose over an API, but you also want to call it locally, it's like the same code. Oh, wow. That's interesting. It's also written in Python 3. It uses Cython all over the place. So it's like super fast. It's one of the fastest web frameworks out there for these kinds of things.

At least the non-async version, let's say. If you compare those, it's pretty cool. It's got a decorator model, so the code looks really clean. Yeah, and the decorator model is cool because the decorator model will do like version management. You can have like version 1 and version 2 of the API that have like different data formats. And they can just coexist. You get automatic documentation based on that.

Like it'll do type annotations and then like use the type annotations as part of the documentation and things like that. Oh, that's great. It's a pretty cool, simple little framework. So, you know, hug for those guys. Nice job. Definitely. Speaking of CLIs. Yeah, speaking of CLIs, I'm actually working on, I had an example I wanted to do that I'm running with the pytest book that I'm working on.

And for the front end of it, I was punting before and not using actually putting a front end on the application. But I wanted to at least put a command line interface in. And my first attempt was to go down arg parse. And the particular quirks of this application, I needed sub commands. Actually, just the tutorials I found were out of date. It doesn't work. And I was having a little bit of difficulty. So I went ahead and tried CLIQ. I'd heard of CLIQ before and hadn't tried it.

And, man, a tutorial from like three years ago was about what I needed. And it works right away. I've got like half a page of code and my interface, my command line interface is done. That's really cool. That's also decorator heavy, right? Yeah, in my sublime editor, it's colored nicely. And my wife walked by and said, that's such beautiful code. Oh, lovely. Let's take that on many, many levels, right? That's awesome. Yeah, that's by Armin Roeneker, a guy from Flask. So definitely.

Oh, did he do CLIQ? I think so, yeah. I believe so. Yeah, nice. CLIQ is cool. I've done a little bit of work with it. And I've liked what I've seen. But I also kind of want to, yeah, we'll talk about it later. But I might want to try adding a different CLI interface to it as well. Yeah, cool. So the last one that I chose for us is kind of a refresher, back to the fundamentals type thing. So Python inheritance class and instance class and static methods demystified.

So this one is on realpython.com. And I went over there and checked it out. And I said, oh, OK, realpython.com. That's cool. And then I realized this is actually from Dan Bader. And we seem to be covering a lot of Dan's stuff over here. And I actually have more to say about Dan later still. So this was a guest pose Dan did for that, although I didn't realize that until I started getting into it.

And the idea was to demystify what's behind class methods, static methods, and regular instance methods. If you learn Python classes, if you learn classes and inheritance and object-oriented programming only through Python, this will be obvious to you. But if you come from other languages like C++ or Java or C# or JavaScript, there's differences to the way Python classes and inheritance works. And it's worth kind of a compare and contrast. So he comes up with a class.

And it's got like a regular method, a class method. So an at class method decorator. And takes a CLS parameter. And a static method with an at static method decorator. Nothing. And basically compares and contrasts how they work. And so some of the things that I think are not obvious when you're first getting started is like instance classes. Those are pretty straightforward. Like you call them on instances like all other languages.

But the fact that I can call static methods or class methods on instances, that's a little bit funky. Right? Yeah. That seems a little weird. And then the other one, the main one I think is like what's the difference? Why are there two things like static method and class method? They seem the same. Why are there two? And then like when would I use one versus the other? Right? The class method takes a CLS method, which is literally the type that it's on. And the static method just doesn't.

But other than that, they seem the same. Right? And so if you're going to say like interact with the class, like during the class method, if you're going to create an instance of the class, you can use the CLS parameter to support like inheritance and stuff. So if I got like a, let's say a vehicle class and a car, like a Tesla car class, that static method could say like allocate a CLS, whatever that is.

And if you called it on a Tesla static ish function class method, it would actually create a Tesla. It would change like the thing, the type that it knows it is, where the static method is just like a grouping. So I thought that was interesting. Does the class method follow then the hierarchy then? So if I declare a class method on a base class, does it, is it available to the subclass? Yes, always. And that's always true for static methods.

But the difference is the static method doesn't really know what type it's being called on. Oh, okay. Whereas the class method, it's given the type. So if there's like, you call it farther down in the inheritance chain, that whatever level you're at, that instant or that type actually is communicated to it. And so you're kind of, you're told where you are in the hierarchy in a class method, where in static, it's just like, it's just a method. Go for it.

Okay. Yeah. I don't think I've ever used static methods for anything. Yeah. Well, they're out there hanging out with their friend class methods. Interesting. Indeed. So I have a quick follow-up from the last show. David Bieber from Google, the guy who works on Python Fire, sent us a note. And you said something to the effect of, look, Python Fire is awesome, but IPython is a serious dependency to take if I just want to see a lie, right? And I think that's fair. That's fair.

But he said, hey, you know what? One of our primary plans is to remove IPython as a dependency. We're just not there yet. So if anybody in the audience wants to help those guys move forward, they're totally working on that. And so Python Fire from Google is definitely getting some interesting thinning out, and it'll be very nice. And actually, I like to hear that, that they're working on eventually getting rid of that dependency. And it's pretty cool.

Also, it's something I had mentioned when we talked about Python Fire, that your development time is important, too. And putting an interface together with that is pretty fast. So keep that in mind. Yeah. It's not always about optimizing for the machines. Definitely. Hey, one more follow-up is we did cover PDR2 or PDR a couple episodes ago with the DUR colors prints out. One of the complaints I had was that it didn't look that great on my black terminal. I had the same problem.

I like darker stuff. And I'm like, wait, where's all the words? They just updated it. And I guess yesterday, I think. And it does have color configuration now. So you can drop a PDR2 config file in your home directory. And I set my background color to magenta so that it was visible for docs, visible on both black and white. And now it looks great. Oh, nice. PDR2 now has themes. Love it. All right. How's the book coming? I heard there's a spotting.

Yeah. So on Twitter the other day, somebody, a guy named Jacob Jarros, I think that's right, noticed that it was listed on the Pragmatic Publishers website. So it's out there. That's awesome. I love the cover. The rocket is cool. Yeah. A 50s sci-fi nerd. So yeah. And just the perfect, it's perfect. Like it's 50s, 60s vintage rocket. So how about you? Well, it has been a super busy couple of weeks. I've been working on a couple of classes. One of them I'm about to release.

By the time this recording comes out, it will be out. So tomorrow, basically. A course called Using and Mastering Cookie Cutter. So really deep dive into what is cookie cutter? How do you create and manage projects with cookie cutter? I think it's going to be a really fun course. And I also just a few hours ago launched Managing Python Dependencies with pip in Virtual Environments, which Dan Bader, speaking of Dan Bader, came over to join me to write a class for us over here.

And we're shipping that as well. So I took that course and I actually learned quite a bit from it. It's not just like pip install done. It's what is the process that you use to manage your dependencies? How do you like, what is the thinking and workflow you use to evaluate whether a package is worth taking a dependency on? And all sorts of cool stuff like that. Bunch of best practices. Launched both of those.

And I just started selling course bundles on Talk Python training as well to sort of go along with those. So lots of stuff. That's pretty exciting. I got to check out the cookie cutter thing. Yeah, thanks, Ed. It'll be out tomorrow morning. For everyone listening, that's today. That's today. But for you, Brian, that's tomorrow morning. The magic of time travel. Thanks so much for finding all these great items. That was fun as always, Brian. It was fun for me too.

And thanks to everybody for all your feedback that you send in. Yep. Thanks, everyone. And thank you, Rollbar, for supporting the show. Thank you for listening to Python Bytes. Follow the show on Twitter via at Python Bytes. That's Python Bytes as in B-Y-T-E-S. And get the full show notes at pythonbytes.fm. If you have a news item you want featured, just visit pythonbytes.fm and send it our way. We're always on the lookout for sharing something cool.

On behalf of myself and Brian Okken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.

Transcript source: Provided by creator in RSS feed: download file