Data science has been one of the major driving forces behind the explosion of Python in recent years. It's now used for AI research, it controls some of the most powerful telescopes in the world, it tracks crop growth and prediction, and so much more. But with all this growth, there's an explosion of data science machine learning libraries. That's why I invited Pete Garson onto the show. He's going to share his top 10 machine learning libraries for Python.
After this episode, you should be able to pick the right one for the job. This is Talk Python To Me, recorded July 20th, 2017. Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter
via at Talk Python. Talk Python To Me is partially supported by our training courses. Here's an unexpected question for you. Are you a C-sharp or .NET developer getting into Python? Do you work at a company that used to be a Microsoft shop, but is now finding their way over to the Python space? We built a Python course tailor-made for you and your team. It's called Python for the .NET developer. This 10-hour course takes all the features of C-sharp and .NET that you think you
couldn't live without. Unity Framework, Lambda Expressions, ASP.NET, and so on. And it teaches you the Python equivalent for each and every one of those. This is definitely the fastest and clearest path from C-sharp to Python. Learn more at talkpython.fm/.NET. That's talkpython.fm slash D-O-T-N-E-T. Pete, welcome to Talk Python. Thanks. I'm happy to be here. That's great to have you here. And I've done a few shows on machine learning and data science,
but I'm really happy to do this one because I think it's really accessible to everyone. We're going to bring all these different libraries together and kind of just make people aware of all the cool things that are out there for data science, machine learning. Yeah, it's really crazy actually how many libraries are out there and how active the development is on
all of them. There's new contributions, new developments all the time. And it seems like there's new projects popping up like almost daily. Yeah, it's definitely tough to keep up with, but hopefully this adds a little bit of help for the reference there. But before we get into all these libraries, let's start with your story. How did you get into programming in Python? I started programming at a pretty young age, like sort of back before Stack Overflow and
things like that existed. And I sort of mostly made games. I started with basic, like most people probably from a certain age, and then worked into working on Pascal and was making games for my BBS back in the day, making online games, utilities and stuff like that. And then for Python, later when I worked in games for a long time. And when I worked in games, we were doing like tool automations, like build
automation, certain workflow automation, build pipelines, all that kind of stuff. So Python was something, a tool that we used quite a lot there. So that was where I got my start with Python. Oh, yeah, that's really cool. Python is huge in the workflow for games and movies, way more than people on the outside realize, I think. Yeah, especially for artists. So like a lot of the tools have Python built into them.
And so artists will use it for like automating model exports or rigging and that kind of stuff. So it's pretty popular in that sense. And then also still even just for like building assets for games. Okay, I'm intrigued by your BBS stuff. It occurs to me and it's kind of crazy. There may be younger people listening that don't actually know what a BBS is. Okay, so a BBS is short for bulletin board system.
And it was sort of like in a way the precursor to the internet where you used to host what is effectively sort of a website on your home computer. And people would like call your phone number. And you'd have it hooked up to your modem. And they would like call your phone number, connect to your home computer. So in my case, it was like my computer that I played games on and did my homework on and that kind of stuff. And they could connect and send messages to each other
and download files and play games, very simple games, that kind of thing. So it was like a... Yeah, it was so fun. Yeah, it was awesome. And like, I really, really enjoyed it. And you had a thing called like echo mail back then, which was like this sort of way of like transferring messages all over the world. So, you know, somebody would send a message on your BBS, and then it would like call a whole bunch of
others like in this network. And then somebody in like Australia might answer it. And it would take like days to get back because it would be like this chain of people's BBS is calling the next one. So yeah, there was no internet. It was the craziest thing. Like, we at our house, my brother and I had talked my dad into getting us multiple phone lines so we could work with BBS is like in parallel. And you would send these mails and like at night, there would be like a, like a coordination of the
emails across the globe as these things would like sync up the emails they got queued up. It was the weirdest thing. But I loved it. I don't know, what's it? Trade Wars or Planet Wars? One of those games. I really loved it. For sure. I'm like a huge Trade Wars fan. You can actually play it now. Like there are people who have
it set up on like websites that have like simulated Telnet stuff. And you can, you can play versions of Trade Wars, which I have done recently just to like, don't tell me that you're going to ruin my productivity for like the whole day. Yeah. You'll be after this. You'll be like, Oh, Trade Wars 2002. Is it still, it's still around? People still play it, but it was such a good game. It's fantastic. Yeah, it was
fantastic. It's awesome. All right. So that's how you got into this whole thing. Like what do you do today? You work at ActiveState, right? I do. Yeah. I'm a dev evangelist at ActiveState. So generally that means working with developers, language communities, trying to make our distributions better. So at ActiveState, we do language distributions. Probably a lot of people in the Python space know us that we do
ActivePython and it's been around for a long time. We were founding member of the Python Software Foundation. And so ActiveState has a pretty long history in the Python community. And before that, we were, people probably know us from Perl and now we have a Go distribution and Ruby beta coming out soon. So we're sort of expanding to all these different dynamic language ecosystems.
Sure. That's awesome. So I know that maybe people are a little familiar with some of the advantages of these higher order distributions for Python, but maybe give us a sense of like, what is the value of these distributions over like going and grabbing Go or grabbing Python and just running with that? I think that you've got obviously this sense of curated packages. So there are, you know, in the
Python distribution, there's like over 300 packages. And so you know that they're going to build, know they're going to play nice with each other, know that they have current stable versions, all that kind of stuff. And then additionally, you can buy commercial support. So for a lot of our customers, so we have a lot of like large enterprise customers, they can't actually adopt a language distribution or a tool like that
without commercial support. They need to know that somebody has their back. And so that's something that we offer on these language distributions for those large customers. But for the community and for individual developers, then that is something that having that curated set of packages that you know
is going to work, that you know is going to play nice. And that also is a, maybe a development team lead, you might want a unified install base, so that all your developers have the same development environment and they and you know, it's all going to play nice. And so that's something that's one of the advantages of those. That's really cool. Certainly the ability to install things that have weird compilation stuff. Do you guys
ship the binaries like pre built for that? So I don't have to have like a Fortran compiler or something weird? Exactly. Yes. So they're all pre built, all pre compiled. So I mean, a lot of people depending on what platform you're on, like on Windows, you're not might not even have a C compiler installed and a lot of packages are C based. And so they're pre built, you don't and like you said, you don't need a
Fortran compiler or some, some exotic build tool to actually make it work. It just works out of the box. Yeah. Okay, that's really awesome. And active Python is free. If I'm like a random person, not a huge corporate that wants support. Exactly. Yeah. If you're just a, you know, a developer, it's free to download and free to use. And it even if you are, you know, a large corporation, it's free to use in non production settings. So on your
own. So it's, you know, you can go and just download it, try it out, see if it works for you. Okay, yeah, that sounds sounds awesome. How many of the 10 libraries we're going to talk about would come built in? Do you know off the top of your head? I think that actually almost all of them, but maybe I think cafe is on the list. It's not in the current one, but it is on the list to be included. So I think actually like pretty much all of the other
ones, maybe CNTK as well is still new as well. That's really new. So but you know, we are targeting to have as many of these as we possibly can. And so pretty much most of them are included. That's awesome. So all the libraries that we're talking about, like one really nice way to just get up to speed with them would be grab active, active Python, and you'd be ready to roll. Exactly. Yeah, awesome. Grab them, install them, you're ready to roll right out of the gate.
Cool. All right. So let's start at what I would consider the foundation of them. The first library that you picked, which is NumPy and SciPy. Absolutely. And they are foundational in the sense that a lot of other libraries either depend on them or are in fact built like on top of them. Right. So they're, they are sort
of the base of a lot of these other libraries. And most people might have worked with, with NumPy sort of the, its main feature is that sort of n dimensional array structure that it includes. And a lot of the data that is shipped to a lot of the other libraries is either supported that you can send it a NumPy array, or it requires that you, that you format it that way. So especially when you're doing machine learning, you're doing a lot of matrices and a lot of like higher dimensional
data, depending on how many features you have. It's a really, really useful data structure to have in place. Yeah. So NumPy is this sort of array like multi-dimensional array like thing that stores and does a lot of its processing down in a C level, but has of course, it's programming API and Python, right? Yes. Yeah, exactly. And a lot of these machine learning libraries do tend to have C level,
like lowest level implementations with a Python API. And that's predominantly for speed. So when you're doing tons and tons and tons of calculations, and you need them to be really, really lightning fast, that's the primary reason that they do these things, you know, sort of at the C level. All right, absolutely. And so related to this is SciPy. They're kind of grouped under the same organization, but they're not the same library exactly, are they?
No. So SciPy is like a more scientific mathematical computing thing. And it has the more advanced like linear algebra and like Fourier transforms, image processing, it has like a physics calculation stuff built in. So most like scientific numerical computing functionality is built into SciPy. I know that NumPy does have like linear algebra and stuff in it. But I think that the preferred is that you use SciPy for all that kind of linear algebra crunching.
Okay, yeah. So a lot of these things that we're going to talk about will somewhere in them have as a dependency or an internal implementation of some variation, or even in maybe in its API, like the ability to pass between them, these NumPy arrays and things like that. Absolutely. Yeah. One other thing that's worth noting, that's pretty interesting. And I think this is a trend
that's growing. Maybe you guys have more visibility into it than I do. But NumPy in June 13th, 2017, so about a month ago at the time of the recording, received a $645,000 grant for the next two years to grow it and evolve it and keep it going strong. That's pretty cool. It is very cool. And I think that you're starting to see that these open source projects are really forming the backbone of most of the machine learning research and actually implementation that you're
seeing out there in the world. There's not a lot of sort of more closed source behind trade secret stuff. A lot of the most bleeding edge development and active development is happening in these open source projects. So I think it's great to see them receiving funding and sponsorship like that. Yeah, I totally agree. And it's just going to mean more good things for the community and all these projects. It's really great to see. One thing I want to touch on for every one of these is to give
you a sense of how popular they are. And for each one, we'll say the number of GitHub stars and forks. And that's not necessarily the exact right measure for the popularity because maybe this is you like obviously NumPy is used across many of these other things which have more stars, but people don't necessarily contribute directly to NumPy. So on. But for NumPy, NumPy has about 5,000 stars and 2,000
forks to give you a sense of how popular it is. The next one up, scikit-learn has 20,000 stars and 10,000 forks. So tell us about scikit-learn. scikit-learn is, again, like we mentioned before, is a thing that's built on top of scipy and NumPy and is a very popular library for machine learning in Python. And I think it was one of the first, if not the first, I'm not 100% sure, but it's been around for
quite a long time. And it supports a lot of the sort of most common algorithms for machine learning. So that's like classification, regression tools, all that kind of stuff. I actually just saw like a blog post come up in my feed today where Airbnb was using scikit-learn to do some kind of like property value estimation or something using machine learning. So it's being used very, very widely in a lot of different scenarios.
Oh yeah, that sounds really cool. It definitely is one of the early ones. And it's kind of simpler in the sense that it doesn't deal with all the GPUs and parallelization and all that kind of stuff. It just, it talks about classification, regression, clustering, dimensionality, and modeling, things like that, right?
Yes, that's right. It doesn't have GPU support. And that can make it a little bit easier to install if you, you know, sometimes the GPU stuff can have a lot more dependencies that you need to install to make it work. Although that's getting better in the other libraries. And it's like you say, it is made and sort of designed to be pretty accessible and pretty easy, you know, because it has the sort of baked in algorithms that you can just say, oh, I want to do this and it will
crunch out your results for you. So I think that that's sort of the sort of ease of use and the sort of cleanliness of its API has contributed to its sort of longevity as a, one of the most popular machine learning libraries. Yeah, absolutely. And it's obviously scikit-learn being part of the scipy whole family. It's built on numpy, scipy, and matplotlib.
Yes. Yes. So yeah, it includes interfaces for all that stuff and for like graphing the output and using matplotlib and yeah, using numpy for inputting your data and for getting your data results, all that kind of stuff. Yeah. Very cool. All right. Next up is Keras at 17.7 thousand stars and 6,000 forks. So this one is for deep learning specifically, right? Yeah. And so this is for doing rapid development of neural networks in Python. It's one of the
newest ones, but it's really, really popular. I've had some experience working with it directly myself and I was sort of really, really blown away by how simple and straightforward it is. So there's like, it creates a layer on top of lower level libs like TensorFlow and Theano and lets you just sort of define, I want my network to look like this. So I want it to have this many layers and this many nodes per layer. And here are the activation functions. And, you know, here's the optimization
method that I want to use. And you sort of just define this effectively a configuration, and then it will build all of the graph for you, depending on what backend you used. And so it's very, very easy to experiment with the like shape of your network and with the different activation functions. So it lets you kind of really quickly reach and test, you know, different models to see which one works better and to sort of see what one works at all. So it's really easy to use
and really very effective. I used it to build a little game demo where we like had an AI where I trained an AI to play against you to determine when it could shoot at you. Was this the demo you had at PyCon? It is. Yeah. Yeah. And so we had that demo at PyCon. I since did a blog post about it a little bit. And then I actually just recently rewrote it in Go for Go4Con too. So eventually it will be open sourced
so that people can see. But one of the things that you really notice is that the actual like code for Keras to basically define the network and do the sort of machine learning heavy lifting part is very, very minimal, like a dozen lines of code or something like that. It's really surprising because you think it's like a ton of work, but it makes it super easy. Yeah, that's really cool. And it sounds like
its goal is to be very easy to get started with. I like the idea of the ability to switch out the backend from say TensorFlow to CMTK to Theano. How easy is it to do that? Like if I'm, could I run some machine learning algorithms and say, let's try it in TensorFlow and say, do some performance benchmarks and stuff? No, no, let's switch it over to Theano and try it here and kind of experiment rather than completely rewriting in those various APIs. Exactly. You literally, it's just a configuration
things. You just, it's almost like a tick box essentially, you know, like it's so easy. And so that is absolutely one of the, I think the driving key features of that library that you can just pick whichever one suits your purpose or your platform, you know, depending on what's available on the platform that you're building for. Cause currently there's not TensorFlow versions for every platform on every version of Python and all that kind of stuff. Right. Okay. Well, that's,
that's pretty cool. So there's two interesting things about this library. One is the fact that it does deep learning. So maybe tell people about what deep learning is. How does that relate to like standard neural networks or other types of machine learning stuff? Well, I think the sort of the simplest way to put it is the idea of like adding these additional layers to your network to create a more sophisticated model. So that allows you to create things that can take
more sophisticated feature domains and then map those to an output more reliably. So, and that's where you've seen a lot of advances, for instance, like in like a lot of the image recognition stuff that leverages deep learning to be really, really good at identifying images or even doing things like style transfer on images where you have a photograph of some scene and then you have some other photograph
and you're like, I want to transfer the style of the evening to my daytime photograph. And it will just do it and it looks like pretty normal. And those are like the most, I guess, popular, common, deep learning examples that you see cited. Yeah, it makes a lot of sense. And you know, it's, it's easy to think of these as being like, I know, Snapchatty, like, sort of superfluous type of examples. But you know, machine learning,
doing them, like, you know, putting the little cat face on or switching faces or whatever. But, you know, there's real meaningful things that can come out of this. Like, for example, the detection of tumors in radiology scans, and things like that. And these deep learning models can do the image recognition on that and go, yep, that's cancer, you know, maybe better than even radiologists can already. And then in the future, it's gonna get crazy.
Exactly. And it's funny, you mentioned that Stanford Medical about a month ago, month and a half ago, actually released like, I don't know how many, like 500,000 radiology scans that are like annotated and ready for training machine learning. So that exact use case is intended to be like a deep learning problem to be applied. And there are all kinds of additionals of these datasets that are coming out. I just saw a post this week about deep learning model that was using
that was measuring heart monitor data and being more effective than cardiologists kind of thing. So It's really crazy. You think of this AI and automation disrupting low end jobs, right? Like, at McDonald's, we might have robots making our hamburgers or something silly like that. But if they start cutting into radiology and cardiologists, and that's, that's gonna like, it's gonna be a big deal.
It absolutely is gonna be a big deal. I think people probably start need to start thinking about it. I don't think it's necessarily a complete replacement thing. It's not, you know, the radiologist AI can't talk to you yet, I guess. And until wait till we get to NLTK, but it can definitely augment and lighten the load on professions like medicine that are, you know, perpetually overworked and allow them to be more
effective, you know, human doctors. So I think like as tools, these things are going to be absolutely incredibly revolutionary. Yeah, it's gonna be amazing. You know, do you want a second opinion? Let's ask, let's ask the super machine. Exactly. But I mean, it's able to one of the strengths of all these machine learning models is that the machine learning models are able to visualize higher dimensional complex data sets
in ways that like humans can't really do. And they have like just intense focus, I guess, right? These models, whereas it might be, it's pretty hard for a doctor to read every single paper ever written on subject X or to look at 500,000 radiology images even across the course of their career. So pretty optimistic where this goes, it's going to be interesting to join all this stuff together. The other thing that we're just starting to touch on here, and it's going to appear in a bunch of
these others. So maybe worth spending a moment on as well is Karis lets you basically seamlessly switch from CPU computation and GPU computation. So maybe not everyone knows like the power of non visual GPU programming. Maybe talk about that a bit. For sure. So your GPU, which is a graphics processing unit. So, you know, if you have a gaming PC at home, and you have like, you know what I mean, an Nvidia graphics card or an ATI grout. Can run the Unreal Engine like crazy or whatever, right?
Oh, exactly. So if you have if you play games, and you have a dedicated graphics card, you well, even without a dedicated graphics card, but you have a GPU, and there's this thing called general purpose GPU programming. So that originally, like a GPU is highly parallel computer has like 1000 cores in it, or whatever, something some huge number of cores. Yeah, the one to four or 5000 cores per GPU, right?
Exactly. Yeah. And so like the intention there was originally that it's because it needs to, in parallel process every pixel, or every polygon that's going on the screen, right, and perform like effects. So that's why you can get like blur and all this kind of stuff in real time, and real time lighting and all that kind of stuff. So it process all that stuff in parallel. But then as the people started to develop SDKs that let you like, well, in addition to
doing graphics programming, we can just run regular programs on these things. And they're really, really fast that cut doing math programs. So we can do that. And so now, basically, a lot of these libraries support GPU processing, and it's literally just like a compile flag. Now it's getting a lot easier, you know, you still have to make sure you have the drivers and that you you know, you have a GPU that's reasonably
powerful that's and especially if you're doing a lot of computation. And so then you can basically run these giant ml models on your GPU. And again, it's something that's pretty, pretty well suited to being parallelized. So that is really great use of GPU. And that's why you're seeing it take off, because these models are are easily made parallel. Yeah, they're what are called embarrassingly parallel
algorithms, right? And just throw them at this, these things with 4000 cores and let them go crazy. Yeah, the early days, I mean, still, I guess, when you're doing direct decks or OpenGL, or these things, like, it's really all about I want to rotate the screen. So that's like a matrix multiplication against all of the vector things. And it's really similar, actually, the type of work it has to do.
The other thing, I guess, which I don't see appearing anywhere in here, but I'm I suspect TensorFlow may have something to do with it, is the new stuff coming from Google, where they have like going beyond GPUs for like, AI focused chips. Did you hear about this? Yes. So Google has a thing called a TPU, which is a tensor processing unit or whatever. And you can that's like a cloud hosted, special piece of hardware that's optimized for doing TensorFlow.
And so I don't know the exact benchmarks in terms of how that compares to, you know, like some gigantic GPU assembly. But obviously, Google thinks that this is a worthwhile investment to build these sort of
hardware racks in the cloud, and then give people access to run their models on there. So I think you're probably going to see more and more specialized, ML targeted hardware that's coming out, whether I don't know whether it's like, you'll obviously consumer hardware, like you can go and buy it, something for your home computer, but especially in the cloud, you definitely will. Yeah, definitely in the cloud. Yeah, it's very interesting. They were talking about real time
training, not just real time answers. So that sounds pretty crazy. This portion of Talk Python To Me has been brought to you by DataCamp. They're calling all data science and data science educators. DataCamp is building the future of online data science education. They have over 1.5 million learners from around the world who have completed 70 million DataCamp exercises to date. Learners get real hands-on experience by completing self-paced, interactive
data science courses right in the browser. The best part is these courses are taught by top data science experts from companies like Anaconda and Kaggle and universities like Caltech and NYU. If you're a data science junkie with a passion for teaching, then you too can build data science courses for the masses and supplement or even replace your income while you're at it. For more information on becoming an instructor, just go to datacamp.com slash create and let them know that Michael sent you.
So speaking of popular libraries and TPUs, the next up is TensorFlow. That originally came from Google and it is crazy at 64,000 stars and 31,000 forks. So tell us about TensorFlow. So TensorFlow, obviously, yeah, is this is Google's machine learning library and this is forms the sort of slightly lower level than something like Keras and like obviously it's used as a backend. You can use it directly as well. And what it does is it represents your model as a computation graph.
So that's effectively a graph where the nodes are like operations. And this is a way that they found is really, really effective to represent these models. And it's a little bit more intimidating to get started with mostly because you have to think about building this graph, but you can use it directly
in Python. Python is actually the recommended language and workflow from Google. So for example, you know, when I rewrote the Go version of our little game there, I still had to train and export my model from Python. So I use Python to build that, export it. So that's the sort of recommended workflow currently from Google for many languages is to use Python as the primary language binding. Yeah, that's, that's really interesting and great to see Python. Python appears in so many of these,
these libraries as a primary way to do it. So there's some interesting stuff about this one. Obviously it's super popular. Google has so many use cases for machine learning, just up and down their whole, you know, everything that they're doing. So having this like developed internally is really cool. It has a flexible architecture that lets it run on CPUs or GPUs, obviously, or mobile devices. And it even lets it run like on multiple GPUs and multiple CPUs. Do you have to do anything to make
that happen? Or do you know how it does that? As far as I can tell that, especially for like this switching between CPU and GPU, it's essentially a compile flag. So you have to build like when you build the libraries or download one of the nightly builds or whatever, you have to get one of the, the versions or that has the enabled GPU support kind of thing built in. And I think that there are also now increasingly like CPU optimizations in there. So like for instance, Intel is doing hand
optimized math kernel stuff that's integrated directly into TensorFlow to make it even faster. So that that's something that you can also get in like the latest version as well. So I definitely think speed and performance and making that stuff easily accessible to depending on what your hardware is and where you're going to deploy it is a big focus for them. Yeah, that's really cool. So do you think this is running in the Waymo cars, you know, the Google self driving cars?
Yeah, I mean, I don't know for sure, but I'd be almost positive of it, you know, from everything that I've read and people that I've talked to. I mean, this is Google built this to use not just, you know, so there, this is the platform for all of their deep learning and machine learning projects. And so I would assume that it's that's TensorFlow is powering that and it's running pretty much all of their all of their stuff. Very, very cool. It's probably in Google photos and some other
things as well. Yeah, Google translate, all those things are all, you know, those things, pretty much all of the projects when you start looking at them that Google is running are all effectively AI projects. And that's basically all the things that, you know, that just recently, like the Google translate, which uses machine learning and like statistical models to do the translations is approaching human level accuracy for translation between a lot of the popular
languages where they have huge, huge data sets to pull from. Yeah, that's crazy. And very, cool. So up next, number five is Theano at 6000 stars and 2000 forks. And this one is really kind of similar to TensorFlow, but really low level, right? Yeah, so it is, you know, more low level, and it is very similar to TensorFlow in the sense that it's also a very high level, high speed math library. And I believe it's actually it was originally made by a couple of the guys who then
went on to Google to make TensorFlow. So it predates TensorFlow by a little bit. But it also has, you know, the things that we're, we're talking about here, it has transparent GPU use. And you can do things like symbolic differentiation, and a lot of like mathematical things, mathematical operations that you want to be highly, highly performant. So it is actually pretty similar to what TensorFlow does,
and sort of serves a similar purpose. But depending on what you're comfortable with, and what your maybe existing projects are, then that is probably going to dictate which one you're using. And if you're using something like Harris, then you can just choose this as the back end. And I flip the switch,
just flip the switch. And there you go. Yeah, it's cool. It also says it has extensive unit testing and self verification where it'll detect and diagnose errors, maybe you've set up your model wrong or something like that. That's pretty cool. That's pretty cool. Yeah, for sure. I mean, all of these libraries are built by super, super smart, accomplished people who are creating things that are, you know, solving a real world problem for them and really, you know, sort of pushing things
forward. And I actually think it's great that there's so many, so many libraries in this space, because it really is just making it better for everybody. Yeah, the competition is really cool to see the different ways to do this and probably cross pollination. Exactly. Yeah. Yeah. So one of the things you have to do for these models is feed them data. And getting data can be a super messy thing. And the one library that stands out above all the others about taking transforming,
redoing, cleaning up data is pandas, right? Absolutely. Yeah. Pandas is, is one of those, those libraries that if you're manipulating, especially large sets of data and real world data, then this is the one that, that people, you know, repeatedly come back to. And yeah, so pandas is, for those that might not know, is like a, you know, data munging data analysis library that lets you transform it. One of the hardest parts when you're doing machine learning is actually getting your data
into a format that can be used effectively by your model. And so a lot of times real world data is pretty messy, or it might have gaps in it, or it might not actually be formatted in the right units. So it might not be sort of normalized so that you're within the right ranges. And if you feed the models, just sort of raw data that hasn't really been either cleaned up or, or formatted correctly, then what you might find is that the model doesn't converge or you get what seems like random results
or things that don't really make sense. And so, you know, spending this time and having a library that makes manipulating, especially very large sets of data, very easy, like pandas is super useful. And even just for instance, like when I was doing that, that little demo there that, that we talked about originally, you know, when I started, I was feeding things raw, raw pixel values for positions
and velocities and stuff. And it just wasn't working. And it wasn't until I really normalized the data, cleaned it up that I had started getting good consistent results. So it's, you know, dealing large scale data sets and being able to manipulate them effectively is super important. Yeah. At the heart of all these successful AI things, these machine learning algorithms and whatnot is a tremendous amount of data. It's why the companies that we talk about doing well are like
enormous data sucking machines like Google and Microsoft and some of these other ones. Right. Exactly. And that's where the power of them comes from is like, you know, Google has access to like just massive amount of data that we don't have access to regular people. Or like we were talking about earlier
with like the radiology images, you need to do need a fairly large set of annotated data. And so that's data where, you know, these are case files or whatever that, you know, a doctor has already gone through and said, this one was a cancer patient, this one wasn't. And without that kind of annotated data, the models can't really learn. They need to know what the answer is. Right. And so that's really, really important.
Yeah. We have the whole 10,000 hours become an expert for humans. It's that's kind of the equivalent for machines. Yeah, I guess. Yeah. I don't know what the I don't know what the thing is. It's the machines might need more. That's one of the things that is really interesting about humans is that our neural networks can learn remarkably quickly without having to walk into traffic 1000 times or do something like that. And so there's I don't know, there's some magic going on there or something.
Yeah, there sure is. All right. Next up is cafe and cafe two. And this originally started out as a vision project, right? That's right. Yeah, Berkeley. And so this was primarily a vision project. And then there's a sort of successor that is backed by Facebook, actually, and is more general purpose and is sort of optimized for web and
mobile deployment. So obviously, you know, if you want to have machine learning based apps on your phone, then having a library that sort of targets that is pretty important. Yeah, I'm sure we're going to see more of that. I mean, there are even rumors. I don't know how trustworthy they are that the next Apple maybe actually today analysis that the next iPhone will have a built in AI chip. I remember that they just announced so Apple actually just announced machine language SDK core ML at
WWDC in June. And so Apple is already targeting these sort of deployed ML models. So, you know, in that that library's case, you are effectively choosing a pre-made model. So I want image recognition or I want, you know, language parsing in my app. And then you can just feed these sort of pre-trained models. But it wouldn't surprise me, you know, they've got the was like the motion chip in your iPhone now. Yeah, they got the motion chip. Yeah.
So it wouldn't surprise me at all that to start seeing that phones are deploying AI chips in there to assist with this because most of the sort of things like Siri is a machine learning based thing. Right. So yeah. Yeah. It's and it doesn't make sense to go to the cloud all the time. Like that's one of the super annoying things about Siri is you ask it a question and it's like six seconds later. Like you
ask it something simple like what time is it? 10 seconds later, it'll tell you it's such and such. Like, is it really that hard? Yeah. Yeah. It's got to go all the way to the cloud and you're in some sketchy network area or something. Right. Exactly. And so that I wouldn't be surprised to start seeing
that stuff deployed onto onto mobile. I think at even at build Microsoft's conference, they started talking about edge machine learning where like the machine learning happen is getting pushed to all these IOT devices that they're working on as well. So a lot of a lot of attempts in this area. For sure. Yeah. And that's the next big thing, right? Is like having IOT based machine learning devices. Like, can your fridge learn like your grocery consumption habits and, you know, suggest
tell you like you're going to run out of milk in two days and you're going to the store today. Maybe you should pick some up. I mean, it's going to happen kind of crazy, but it totally will happen. And yeah. Yeah. I mean, it doesn't sound as crazy as let's just let a car go drive in a busy
city on its own. That's true. And yet that's, that's something that exists now, right? Like that's, that's a, that's a thing like you can, and maybe it's not fully autonomous, but I mean, you could go and buy one like tomorrow, you could buy a car that you can turn on autopilot and like, it's crazy. It's fully drive for you. So the future is now, the future is here. It's just not
evenly distributed. This portion of Talk Python is brought to you by us. As many of you know, I have a growing set of courses to help you go from Python beginner to novice to Python expert. And there are many more courses in the works. So please consider Talk Python training for you and your team's training needs. If you're just getting started, I've built a course to teach you Python the way professional developers learn by building applications. Check out my Python jumpstart by
building 10 apps at talkpython.fm/course. Are you looking to start adding services to your app? Try my brand new consuming HTTP services in Python. You'll learn to work with RESTful HTTP services, as well as SOAP, JSON and XML data formats. Do you want to launch an online business? Well, Matt Makai and I built an entrepreneur's playbook with Python for entrepreneurs. This 16 hour course will teach you everything you need to launch your web-based business with Python. And finally,
there's a couple of new course announcements coming really soon. So if you don't already have an account, be sure to create one at training.talkpython.fm to get notified. And for all of you who have bought my courses, thank you so much. It really, really helps support the show. One little fact or a quote from the cafe webpage that I want to just throw out there because I thought it was pretty cool before we move
on. They say, speed makes cafe perfect for research experiments and industry deployments. It can process 60 million images per day on a single GPU. That's one millisecond per image for inference and four milliseconds per image for learning. That's insane. So fast. And 60 million images per day is just like, it's crazy. And that's why we were talking about the data just a minute ago. And the amount of data being poured into these models is just
staggering every day. And I don't doubt that they're probably feeding, people are feeding these models like that much data every day. And I think they were saying 90% of the world's data that's ever been created has been created in the last year. And so it's just one of these things where it gets accelerates and accelerates and builds on all this stuff. So I think these things are just going to get faster until they're effectively real time.
Yeah, absolutely. All right. I don't think we said the stars for that one. 20,000 and 11,000 forks. So up next is definitely one that data scientists in general just live on. And that's Jupyter. For sure. And so this has just become like the standard interchange format for sharing data science, whether it's papers or data sets or models, or this has just become the sort of standard,
I don't know what you're going to call it, lingua franca for exchanging this data. And it's effectively a tool for the thing called a Jupyter notebook, which is like kind of like a web pages with like embedded programs and embedded data sets. I think that's probably a good way to describe it for those who might not have used it before.
Right. It's like instead of writing a blog post or a paper that's got a little bit of code, then a little bit of description, then a picture, which is a graph, it's like live and you can re-execute it and tweak it. And it probably plugs into many of these other libraries and it's using that somewhere behind the scenes to do that.
Exactly. Yeah. It's built on the IPython kernel for that's like interactive Python kernel. Yeah. I'm sure that there are all kinds of specific uses that can run those notebook or that notebook code and use that, that stuff there. Cool. Next up is maybe one of the newer kids on the block in this deep learning story from Microsoft, actually their cognitive toolkit, C and TK. Yeah. And it's, they just released, I think the 2.0 version of it beginning of June or late May.
And, you know, now it's open source and it's, it's got the Python bindings and it's part of, you know, Microsoft's been doing a lot of open source work lately and they've been, you know, really, really pushing a lot of their own projects. And, it's like we said earlier, it's available as a backend for Keras. So it's similar again to TensorFlow and Theano that it's, it's again, focused on that sort of low level
computation as a directed graph. So similar model, I think this is, you know, obviously emerging as a popular and efficient way to represent machine learning models is using that directed graph. So it's pretty popular too, right? It's got a decent number of stars and forks and obviously as a Keras backend and Microsoft backed library, it's going to be pretty popular and pretty common out there. Yeah, absolutely. These days, you know, with, Satya Nadella and a lot of the changes at Microsoft,
I feel like this open source stuff is really taking a new direction, a positive one. And also I think their philosophy is if it's good for Azure, it's good for Microsoft. And so this plugs into their hosted stuff and interesting ways. And they've got a lot of like cognitive cloud services and things like that.
Yeah. Azure is becoming pretty huge. It's like starting to rival maybe even AWS for, you know, a lot of this cloud hosted services and especially around machine learning, like Azure has so many different machine learning tools available. And it's really clearly a pretty, pretty big focus for Microsoft. And again, it's great to see, you know, more of the, you know, the sort of big guns being
more open about their development and sharing. I mean, it drives everybody forward and, and, you know, just accelerates development across the whole ecosystem. Yeah. And they have a number of the Python core developers there. They have Brett Cannon, they have Steve Dower, they have, you know, VLAN, like there's some serious people back there working on the Python part.
Exactly. Yeah. They've got a lot of the Python core team there. And, I know a bunch of the guys from active state were just at PI data in Seattle and, you know, huge number of the core team were there and, you know, just really, really great little conference. They're talking about Python and data science. Yeah. I think they have some really interesting language stuff as well.
So speaking of languages, the, most, certainly the longest running one, probably that's really still going strong is NLTK with 5,000 stars and 1.5 thousand forks.
Yeah. And so NLTK was like the natural language toolkit. And, you know, obviously this is a thing for doing natural language parsing, which is, I guess, one of the holy grails of, of machine learning is to get it to be really, you know, so you can just speak to your, to your computer and completely natural language, and maybe even give it instructions in natural language and, and be able to be able to follow your, for your directions and understand what you're
asking. And so this is like a really popular one in academia for research. They link to and include massive corpora of, of work. So that's like gigantic bodies of text in different languages and in different styles to be able to train models. So there's, there's also like a pretty large, like open data component to this project as well. And, obviously, you know, the use case here for natural language is, you know, it's huge for translation. Like we mentioned earlier,
chatbots, which are now a huge thing for like support. I mean, every website you go onto and it pops up, Hey, I'm, you know, Bob and I'm, can I help you today? And it's like, not a really a person. It's just a chatbot. And, you know, there's just so many. And then like we were saying, Siri and, and Cortana and all those sort of personal assistants where you can say, ask it a natural language question and it can come back to you. So this is the sort of almost like foundational library still going
strong, still tons of active development and research going on with this. Yeah. It's really cool. And especially with all the smart home speaker things, Google home, home pod, all that stuff. This is just, this is going faster, not slower terms of acceleration, right? It's weird talking more and interacting with them way more. Definitely the chatbots. And anytime you have text and you want a computer to understand it, this is like a first step for tokenization,
stemming, tagging, parsing, semantic analysis, all that kind of stuff. Right? Yeah. And that's, that's exactly what it outputs. So it will do is like generate parse trees and, and stem it all out and then use those, the kind of tokenized version to use that to train your model, not sort of raw text characters. And, we really are getting there. I mean, like these days, like for sure, like just the recognition part, you know, the tokenization part is very, very good. It's more like the kind of
semantic meaning. What do you mean when you ask it, you ask Siri for what are the movie times for X or something like that? How specific do you have to be for, to get a reasonable answer from her? Yeah. It's got to go speech to text and then it probably hits something like this. Exactly. Yeah, exactly. That's going to hit a library like this and we're getting there. It's not quite at the Star Trek computer do this for me, but it's like way closer than I kind of ever thought we would
be. It's really pretty impressive sometimes. Yeah, absolutely. It's, it's fun to see this stuff evolve. Absolutely. All right, Pete, that's the 10 libraries. And I think these are all really great selections and hopefully people have got a lot of exposure and maybe learned about some they didn't know about. And I guess encourage, encourage everyone to go out there and try these down and if you've got an idea, play with it with one of them or more. For sure. They're also accessible
now. You know, you don't necessarily have to be ML researcher or a math wizard to actually create something that's interesting or experiment or learn a little bit. These libraries all do a really, really great job of abstracting away some of the more complicated mathematical parts. And, you know, in the case of a lot of them making it reasonably accessible. And so that's where I think you're seeing this kind of like democratization trend in machine learning now where this stuff is
becoming more accessible. It's becoming easier. And I think you're going to see a lot of creativity and a lot of innovation come out of people if they just sort of give it a shot and try something out and, you know, learn something new. Yeah, that's awesome. I totally agree with the democratization of it. And that's also happening from a computational perspective, right? Like these are easier to use, but also with the GPUs
and the cloud and things like that, it's a lot easier. You don't need a supercomputer. You need 500 bucks or something for a GPU. Exactly. That's the, I think all of these sort of things feed into that in together where you have a democratization trend in the tools and the source code so that now a, you can have access to Google's
years and years of AI research via TensorFlow on GitHub. You also, like you said, can go and buy a $500 GPU and have basically a supercomputer on your desktop, but also this open data component where you can get access to massive data sets like the Stanford image library and, you know, these huge NLTK like language corpora that you can then use to train your models where previously that was probably impossible to actually access.
Yeah, that's a really good point because even though you have the machines and you have the algorithms, the data, data really makes it work. All right. So I think let's leave it there for the library. So those were great. And I'll, I hit you with the final two questions. You're going to write some code. What editor, Python code, what editor do you open up? Well, obviously ActiveState has Komodo. So I tend to use that a lot for doing a Python code, but I've also
to be totally fair. I have used VS Code as well, which is getting increasingly popular. So I tend to like to cycle between them all because we have an editor product. And so, you know, it's great to keep up to date on what all the other ones are doing. So I tend to cycle around a little bit, but yeah, like Komodo is sort of my go-to.
Yeah, that's cool. Yeah. It's definitely important to look and see what the trends are, what other people are doing, how can you bring this cool idea back into Komodo, things like that, right? Yeah, for sure. Yeah. All right. And I think we've already hit 10, but do you have another notable PyPI package?
I don't know. There's, there's so many. I would again, probably give a, a little bit of a shout out to, you know, since we're talking about machine learning to Keras, because I do think as an entry point to machine learning, it's so accessible. It's so easy to at least get started and get a result with. I would give a little shout out to that, that I think that if you're looking to get into this and you're looking to try it out, that's a really great place to start.
Yeah, I totally agree with you. That's, that's where I would start as well. All right. Well, it's very interesting to talk about all these libraries with you. I really appreciate you coming on the show and sharing this with everyone. Thanks for being here. Thank you for having me. You bet. Bye. This has been another episode of Talk Python To Me. Our guest has been Pete Carson, and this episode has been brought to you by DataCamp and us right here at Talk Python Training.
Want to share your data science experience and passion? Visit datacamp.com slash create and write a course for a million budding data scientists. Are you or a colleague trying to learn Python? Have you tried books and videos that just left you bored by covering topics point by point? Well, check out my online course, Python Jumpstart by Building 10 Apps at talkpython.fm/course to experience a more engaging way to
learn Python. And if you're looking for something a little more advanced, try my Write Pythonic Code course at talkpython.fm/pythonic. Be sure to subscribe to the show. Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, Google Play feed at /play, and direct RSS feed at /rss on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it.
Now get out there and write some Python code. Thank you. Thank you. Thank you.
