#207: Parallelizing computation with Dask

00:00

What if you could write standard NumPy and Pandas code, but have it run on a distributed computing grid for incredible parallel processing right from Python? How about just splitting the work across multi-processing to escape the limitations of the GIL right on your local machine? That's what Dask was built to do. On this episode, you'll meet Matthew Rockland to talk about its origins, use cases, and a whole bunch of other interesting topics.

00:21

This is Talk Python To Me, episode 207, recorded February 20th, 2019. Welcome to Talk Python To Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via at Talk Python.

00:56

Hey folks, a quick announcement before we get to our conversation with Matthew. People have been enjoying the courses over at Talk Python Training for years now, but one shortcoming we've been looking to address is to give you offline access to your content while you're traveling or if you're just in a location with spotty internet speeds. We also want to make taking our courses on your mobile devices the best they can be. That's why I'm thrilled

01:19

to tell you about our new apps, which solve both of these problems at once. Just visit training. talkpython.fm/apps and download the Android version right now. The iOS version is just around the corner. Should be out in a week or so. You'll have access to all of your existing courses. But we've also added a single tap feature to allow you to join our two free courses, the Responder Web Framework mini course and the MongoDB quick start course. So log in,

01:44

tap the free course, and you'll be right in there. Please check out the apps and try one of the free courses today. Now, let's talk Dask with Matthew. Matthew, welcome to Talk Python To Me. Thanks, Michael. I've listened to the show a lot. It's really an honor to be here. Thank you for having me. It's an honor to have you as well. You've done so much cool work in the distributed computing space, in the data science and computation space. It's going to be super fun to dig into all

02:08

that and especially dig into Dask. But before we get to that, of course, let's start with your story. How'd you get into programming in Python? Yeah. So I got into programming originally actually on a TI basic calculator. So I was in math class, plugging along, spent my few thousand hours not being a good programmer there. Picked up some C++ in high school, did some MATLAB and IDL and C# in college, did some engineering jobs,

02:32

did some science. And then in grad school, just tried out Python on a whim. I found it was all of those things at the same time. So it was fun to work in object-oriented mode, like with C#. It was fun to do data analytics, like with MATLAB. It was fun to do some science work that I was doing in astronomy at the time, like with IDL. So it was sort of all things to all people.

02:51

And eventually I was in a computer science department and I found that Python was a good enough language that computer scientists were happy with it, but also usable enough language that actual scientists were happy with it. I think it's probably where Python gets a lot of its excitement, at least on the sort of numeric Python side. I think that's a really nice insight. And it sounds like a cool way you got into programming.

03:12

I feel like one of Python's superpowers is it is kind of what I think of as a full spectrum language. You can make it what you need it to be at the beginning. But like you said, professional developers and scientists can use it as well. And you kind of touched on that, right? Like you can use it like C# or you can use it like MATLAB or you can kind of be what you need it to be. But if you need more, it's not like, well, now you go right in C++. It's just like, we'll now use these other features.

03:38

Yeah, definitely. I'd also say it has a lot of different parts to it. Like there, before there was like the web programming community and it was a community that used it kind of like Bash, like a sysops community. And there was the, you know, scientific community. I know those are all kind of the same and it's really cool to see all those groups come together and use each other's stuff. You see people using scikit-learn inside of a Flask app deployed with some other Python thing.

03:59

Ansible or something, right? Yeah, for sure. And the fact that all those things were around makes it sort of the cool state that it's in today. Yeah, it's just getting better, right? But it was already good. It's just like connecting all the good parts. Exactly. It's the interconnections are building a little bit stronger across these disciplines, which it's kind of rare, right? There's not that many languages and ecosystems where that happens,

04:18

right? Like think of Swift for iOS, right? Swift is a pretty cool language. It has some issues that I could take with it, but it's pretty cool. But it's not also used for data science and for DevOps, right? It's just not. Right. Yeah. That's pretty cool. Also, you said you did some IDL. It sounds like you were doing astronomy before, right?

04:34

Yeah. This is just an undergrad. I was taking some astronomy courses. It was fun though, because I knew how to program and I was just really productive as a scientist because I was fluent in programming. That ability to get more productive was really good. It took me away from science and towards computer science, but in service of science. So that was my experience. Now you're building these platforms where legitimate world-class science is happening. Yeah. No, it's cool.

04:59

Right? You must be really proud, right? To think of what people are doing with some of the software you're building and you're collaborating on. Yeah, the software that all of us build. And yeah, it's definitely very cool. It's very satisfying to have that kind of impact, not just in one domain, but across domains. And I get to work with lots of different kinds of people every day that are solving, I think, high impact problems. And yeah, that's definitely a strong motivator for me.

05:20

Yeah, I'm sure it is. Programming is fun, but it's really not fun to work on a project and have nobody use it. But to see it thrive and being used, it really is a gratifying experience. Yeah. I'll go so far as to that. Programming is no longer fun. It's just the way we get things done and seeing them done, as you're saying, is really where a lot of the satisfaction comes from. Yeah, that's cool. So how about today? What are you up to these days?

05:42

You were, to put it past tense, you were doing a lot of stuff with Anaconda Inc., previously named Continuum, but same company, but recently you've made a change, right? Yeah. So I like to say it's the same job, just a different employer. Yeah. So I guess my job today is a few things. I work in the open source numeric Python ecosystem. So this is usually called like SciPy or PyData. And I mostly think about how that system scales out to work with

06:10

multi-core machines or with distributed clusters. I usually do that with a project called Dask. That's maybe like the flagship project that I help maintain, but I'm just generally active throughout the ecosystem. So I talk to the NumPy developers, the Pandas developers, scikit-learn and Jupyter developers. I think about file formats and protocols and conventions. There are a few of us- Networking. Networking. Yeah. Lots of things. So generally in service

06:31

of those science use cases or in business use cases, there's tons of things that go wrong. And I tend to think about whatever it is needs to get handled. But generally I tend to focus on Dask, which I suspect we'll talk about more today. Yeah. A tiny bit. So speaking of Dask, let's set the stage, talk a little bit about, you know, why does Dask exist and so on. And I guess maybe start with a

06:53

statement that you made at PyCon, which I thought was pretty interesting. And you said that Python has a strong analytical computation landscape and it's really, really powerful, like Pandas and NumPy and so on. But it's mostly focused on single core computation and importantly, data that fits into RAM. You want to elaborate on that a little? Sure. So everything you just said, so all of those libraries, so NumPy and Pandas and scikit-learn are loved today by data scientists because the APIs are,

07:23

they do everything they want them to do. They seem to be well tailored to their use cases. And they're also decently fast, right? So they're not written in pure Python. They're written in C, Fortran, C++, Cython, LLVM, whatever. Right. And so much of the interaction is like, I have the data in Python. I just pass it off. Or maybe I don't even have it in Python. I have like a NumPy array, which is a pointer to a data down in C level, which I then pass that off to another

07:46

part. And then internally down in C, it cranks on it and gives it back like five or something, right? Like, so it's not, it's a whole, I think the whole story around performance and is Python slow, is Python fast? Like it gets really interesting really quick because of those types of things. Yeah. So we're, we have really nice usable Python APIs, which are kind of like the front end to a data scientist. And that's hitting a backend, which is C++ Fortran code, which is very fast.

08:13

And that combination is really useful. Those libraries also have like now decades of expertise, PhD theses thrown into them, just tons of work to make them fit. And people seem to like them a lot. And it's also not just those packages. There's hundreds of packages on top of NumPy Pandas scikit-learn, which do biology or do signals processing or do time series forecasting or do whatever. And there's this ecosystem of thousands of packages, but that code, that code that's very

08:41

powerful, it's really just powerful on a single core, usually on data that fits in RAM. So as sort of big data comes up or as these clusters become more available, the whole PyData SciPy stack starts to look more and more antiquated. And so the goal that I was working on for a number of years and continue to work on is how do we elevate all of those packages, you know, NumPyPanda, scikit-learn, but also everything that depends on them to run well on distributed machines,

09:04

to run well on new architectures, to run well on large data sets? How do we sort of refactor an ecosystem? That's a really hard problem.

09:11

You know, that's a super hard problem. And it's, I think, one of the challenges that PyPy, P-Y-P-Y, the JIT compiled alternative to CPython has such, I think that would have really taken off had it not been for, yes, we can do it this other way, but you have to lose NumPy, or you have to lose this C library, or you lose the C speedups for SQLAlchemy, or whatever, right?

09:36

You're trying to balance this problem of like, I want to make this distributed and fast and parallel, but it has to come along as is, right? Right. Yeah, we're not going to rewrite the ecosystem. That's not a feasible project today. Right. I mean, theoretically, you could write some alternate NumPy, but people trust NumPy. And like you said, it's got so much polish, it's insane to try to recreate it.

09:57

Yeah. It also has, even if you recreated it, there's a lot of weight on it. A lot of other packages depend on NumPy in very arcane ways, and you would have a hard time reproducing all of those. Right. Yeah. It's like boiling the ocean, or trying to boil the ocean, pretty much. Okay. And then, so that's the problem is like, we have problems that are too big for one machine. We have great libraries that we can run on one machine. How do you take that ecosystem and elevate it in a way

10:26

that's roughly the same API, right? We had certain things like that, right? Like Spark, for example, is one where you can do kind of distributed computation, and there's Python interactions with Spark, right? Yeah, definitely. But Spark is again, going down that rewrite path, right? Spark is not using NumPy or Pandas. They're rewriting everything in Scala, and that's just a huge undertaking. Now, they've done a great job. Spark is a fantastic project. If Spark works for you, you should use

10:50

it. But if you're used to Pandas, it's awkward to move over. Most good Python Pandas devs that I know who moved to Spark eventually move to Scala. They eventually give up the ecosystem, they move over. And that makes sense, given how things work. Right. But if you wanted to say... Because if you're going to work there, you might as well be native in that space, right?

11:05

But if you wanted to say use NumPy, there is no Spark NumPy, nor is there a Spark SciPy, nor is there a Spark time series prediction. Everything built on it. Yeah. So Spark really handled sort of the Pandas use case to a certain extent, and like a little bit of the scikit-learn use case. But there's this other sort of 90% ecosystem that's not handled. And that's where Dask tends to be a bit more shiny.

11:26

Right. And isn't... I don't do much with Spark, so you'll have to keep me honest here. But isn't Spark very like MapReduce-y? It has a few types of operations and pipelines you can build, but you can't just run arbitrary code on it, right? Yes and no. So yes, underneath the hood, from a distributed assistance point of view, Spark implements sort of the same abstraction as MapReduce, but it's much more efficient and it's

11:49

much more usable. The APIs around it are much nicer. So if you do something like a group bi-aggregation, that is sort of a mapping and then a production under the hood. And so as long as you sort of stay within the set of algorithms that you can represent well with MapReduce, Spark is a good fit. Not everything in the sort of the PyData SciPy stack looks like that, it turns out. And so we looked at

12:08

Spark early on and, hey, can we sort of use the underlying system to paralyze out PyData? The answer came back pretty quickly, no. It's just not flexible enough. Yeah. And you know, it's, I love that this format is audio, but for this little part, I wish we could show some of the pictures that you've had and some of your presentations and stuff, because the graphs of how all the data flows together is super, super interesting and complicated and sort of self-referential and all sorts of stuff.

12:33

Yeah, no, those visuals are really nice and it's nice that they're built in. You get a good sense of what's going on. Yeah, it's cool. All right. So let's talk about Dask. You've mentioned it a couple of times. Dask is a distributed computation framework focused on parallelizing and distributing the data science stack of Python, right? Yeah. So let me describe it maybe from a 10,000 foot view first, and then I'll dive in a little bit.

12:57

So from a 10,000 foot view, the tagline is that Dask scales PyData. Dask was designed to parallelize the existing ecosystem of libraries like NumPy, Pandas, and scikit-learn, either on a single machine, scaling out from memory to disk using all of your cores, or across a cluster of machines. So what that means is if you like Pandas, you like NumPy, and you want to scale those up to many terabytes of data, you should maybe investigate Dask. So diving in a little bit, Dask is kind of

13:22

two different levels. I'll describe it Dask at a low level first, and then a high level. So at a low level, Dask is a generic Python library for parallel computing. Just think of it like the threading module or like multiprocessing, but just more. There's more sophistication in there, it's more more complex. It handles things like parallelism, like task scheduling, load balancing, deployment, resilience, node fallover, that kind of stuff. Now using that very general purpose library,

13:47

we've built up a bunch of libraries that look a lot like PyData equivalents. So if you sort of take Dask and NumPy and smash them together, you get DaskArray, which is a large scalable array implementation. If you take Dask and Pandas and squash them together, you get DasDataFrame, which uses many PandasDataFrames across a cluster, but then can sort of give you the same Pandas API. So it looks and feels a lot like underlying sort of normal libraries, NumPyrePandas, but it scales

14:14

out. So today when people talk about Dask, they often mean DasDataFrame or DaskArray. Yeah. So just to give people a sense, like this, when you interact with one of these, say, DaskArrays, it looks like a NumPyArray. You treat it like a NumPyArray mostly. But what may actually be happening is you might have 10 different machines in a cluster. And on each of those machines, you might have some subset of that data in a true NumPy array, right? And that's,

14:43

they know how to coordinate operations. So if you do like a dot product on that with some other bit, it can figure out the coordination and communication across all the different machines in the cluster to compute that answer in a parallel way. But you think of it as like, well, I just do a dot product on this array and we're good, right? Just like you would in NumPy. That's exactly correct. Yeah. So Dask is like a very efficient secretary. It's doing all the

15:05

coordination work. NumPy is doing all the work on every node. And then we've written a lot of parallel algorithms around NumPy that teach you how to do a big matrix multiply with a lot of small matrix multiplies. Or on the pandas side, you know, a big join with a lot of smaller joins. That sounds like a lot of coordination and a challenging problem to do in the general sense. Yeah, it's super fun.

15:24

The general solution of that. Yeah, I bet it is. A lot of linear algebra and other types of stuff in there, I'm guessing. Sure. Or a lot of pandas work or a lot of machine learning work. Yeah, yeah. Or a lot of other work. So I want to point out here that the lower level Dask library, the thing that just does that coordination, is also used separately from NumPy and pandas. And I'm sure we'll get to that in the future. But so Dask is sort of this core library that handles parallelism.

15:46

You can use it for anything. There's also like a few big uses of it for the major PyTadet libraries. Yeah. We have Dask Array and we have Dask DataFrame and so on. And these are really cool. And we said the API is really similar, but there's one significant difference. And that is that everything is lazy, right? Like a good programmer?

16:07

Sure. So Dask Array and Dask DataFrame do both do lazy operations. And so yeah, so at the end of your, you know, you do DataFrame, you do, you know, read parquet, filter out some rows, do a group aggregation, get out some small results. You then call .compute. At the end of that,

16:22

there's a method on your Das DataFrame object. And that will then ship off the sort of recipe for your computation off to a scheduler, which will then execute that and give you back a result, probably as a pandas DataFrame. There's two major parts here. One is the data structures that you talked about. And the other is the scheduler idea, right? And those are almost two separate independent things that make up Dask. Two separate code bases even. Okay.

16:46

This portion of Talk Python To Me is brought to you by Linode. Are you looking for hosting that's fast, simple, and incredibly affordable? Well, look past that bookstore and check out Linode at talkpython.fm/Linode. That's L-I-N-O-D-E. Plans start at just $5 a month for a dedicated server with a gig of RAM. They have 10 data centers across the globe. So no matter where you are or where your

17:08

users are, there's a data center for you. Whether you want to run a Python web app, host a private Git server, or just a file server, you'll get native SSDs on all the machines, a newly upgraded 200 gigabit network, 24-7 friendly support, even on holidays, and a seven-day money-back guarantee. Need a little help with your infrastructure? They even offer professional services to help you with architecture, migrations, and more. Do you want a dedicated server for free for the next four months?

17:34

Just visit talkpython.fm/Linode. I want to dig into the schedulers because there's a lot of interesting stuff going on there. One thing that I think maybe it might be worth digging into a little bit for folks is this demo that you did at PyCon around working with, basically you had this question you want to answer, it's like, what is the average or median tip that a person gives to a taxicab driver plotted by time of

18:04

day? You know, like Friday evening, Monday morning, and so on. And so there's a huge data set, maybe you can tell people where to get it, but it's the taxicab data set for some year, and it's got like 20 gigs on disk and 60 gigs in memory. Now, I have a MacBook Pro here that's like all knobs to 11, and that would barely hold a half of that data, right? So like even on a high-end computer, it's still like, I need something more to run this because I can't load that into memory.

18:34

Well, to be fair, actually, that's not by like big data standards, it's actually pretty small. People use Dask on like 20 terabytes rather than 20 gigabytes pretty often. But you're right that it's like, it's awkward to run on a single machine. Yeah, you can do a lot of paging. Yeah, it's a pretty easy data set, though, to run in a conference demo, just because it's pretty interpretable. Yeah. Yeah. Everyone can immediately understand average tip by time of day and day of the week.

18:54

Yeah, so it's a good demo. I'll maybe talk through it briefly here, but I encourage people to look at the PyCon. There's probably PyCon 2017, I think, for Dask. Yeah, and I'll put a link to the talk in the show. Yeah, so in that example, we had a bunch of CSV files living on like Amazon S3 or Google Cloud Storage, I don't remember which. Yeah, Google Cloud Storage. And you could use the pandas read CSV function to read a bit of those files, but not the entire

19:17

thing. But if I had to read everything, it would fill up RAM and pandas would just halt. And that's kind of the problem we run into in PyData today when we hit big data. Everything's great to run out of RAM, then you're kind of, you're out of luck. So we sort of switched out the import of pandas for DasDataFrame, called the DasDataFrame read CSV function, which has all of the same keyword arguments of pandas read CSV, which

19:37

if you know pandas are numerous. And sort of reading one file, we sort of give it a glob string. We'd ask it to read all the files. Yeah, we'd like did normal pandas API stuff. I think we filtered out some bad rows or like some free rides in New York City that we had to remove. We made a new column, which is a tip fraction. We then use pandas datetime functionality, I think, to group by the hour of the day.

19:58

And the day of the week. And if you know pandas, like that API is pretty comfortable to you. Just the data's too big. Yeah, it was the same experience, pandas. Like we switched out the import, we put a star in our file name to read all the CSV files. We might have asked for a cluster. So we might have asked something like Kubernetes or Yarn or Slurm or some other cluster manager to give us a bunch of Dask workers. That one's probably using Kubernetes because it's on Google.

20:20

Those showed up. So Google was fine enough to give us virtual machines. We deployed Dask on them. And then, yeah, then we hit compute. And then all of the machines in our cluster went ahead and they probably called little pandas read CSV functions on different ranges in those CSV files coming off of Google Cloud Storage. They then did different group operations. They did some filtering operations on their own. They're probably somewhat related to the ones that we asked them to

20:43

do, but a little bit different. They had to communicate between each other. So those machines had different pandas data frames. They had to share results back and forth between each other. If one of those machines went down, Dask had to ask for a new one or had to recover the data that machine had. And at the very end, we got back at pandas data frame and we plotted it. And I think

21:00

the punchline for that one is that the tips for like three or 4 a.m. are absurdly high. It's like 40% or something like that. Yeah. Did you have any insight why that is? Do people know? I mean, the hypothesis is that it's people coming home from bars. I was actually giving the same demo at a conference once and someone was like, oh, filter out all the short rides and I'll bet it goes away. And so you did that and the spike does go away. And so it's probably just people getting

21:24

$5 cab rides from the bar. And we were looking at tip fraction, which is sort of going to have a lot of noise. But it was fun because this guy said, hey, try this thing out. And he knew how to do it. He'd never seen Dask before. He'd just known pandas before. Gave him the keyboard. He typed in the pandas commands and it just worked. We got a new insight out of it. And so that sort of experience shows us that a tool like Dask data frame gives us the ability to do pandas operations,

21:47

to extend our pandas knowledge, but across a cluster. That data set could have been 20 terabytes and the same thing would have worked just fine. You would have wanted more machines. It would have been fine. Wait, or one more time. One more time. Yeah. So Dask also works well on a single machine just by using small RAM intelligent ways. Oh, it does. Okay. So it can be more efficient. You could give it a ton of data, but it can realize I can't take it all in one big bite. I got to eat

22:09

the elephant sort of thing. Yeah. So if you can run through your computation, a small RAM, Dask can usually find a way to do so. And so that's actually like Dask can be used on clusters. And that's the flashy thing to show to people at a PyCon talk. But most Dask users are using just their laptops. So for a long time, Dask actually just ran on us on single machines for the first year of its life. We built all the parallel algorithms, but didn't have a distributed scheduler. And there people

22:30

just wanted to sort of stream through their data, but with pandas or NumPy APIs. So people were dealing with 100 gigabyte data sets on their laptops pretty comfortably. Oh, that's interesting. I didn't realize that that was a thing it could do. That's cool. Could it be used in the extreme? Like if I've got a really small system with not much RAM at all, I'm thinking like little IoT things. Could it let me process like not big data, but like normal data and like a tiny device?

22:55

Sure. Depending on how much data or how much you're looking for. Sure. Yeah. Okay. Interesting. That's a super cool feature. Now, one of the first things that struck me when you ran this, like obviously you've got all these machines. I think you had 32 systems with two cores each when you ran the demo and it went really quickly, which is cool. I expected it to come up with a graph and you could say, run it, but we can't even run it locally here. It is much,

23:18

you know, in 10 seconds or something. But instead, what I saw when you did that was there's like a beautiful graph that's alive and animated describing like the life of your cluster and the computation and like different colors, meaning different things. Can you describe like that diagnostic style page? And so people know. Yeah, sure. It's hard to describe a picture, much less like an interactive dashboard. But yeah, so maybe I'll describe a little bit of like the architecture

23:44

of Dask first. I'll make this a bit more clear. Okay. So Dask has a bunch of workers that are on different machines that are doing work, holding onto data, communicating with each other. Then there's a centralized scheduler, which is keeping track of all those workers. This is like the, like the foreman at a job site, maybe of all these workers telling the workers what to do,

24:01

telling them to share with each other, et cetera. And that foreman, that scheduler has a ton of information about what all the workers are doing, all the tasks that are being run, the memory usage, network communications, you know, file descriptors that are open on every machine, et cetera. And in trying to benchmark and profile and debug a lot of Dask work, we found that we wanted access to all that information. And so we built this really nice dashboard. And so, yeah, it tells you all that

24:26

information. You saw one page at that demo, but there are like 10 or 20 different pages of different sets of information. So there's just a ton of state in the scheduler. And we wanted to expose all that state visually. This ends up being, it was originally designed for the developers, but it ends up being critical for users because understanding performance is really hard in the distributed system, right?

24:45

Our understanding of what is fast and slow in a single machine gets totally turned around when you start adding concurrent access to disk and network communications and everything. So yeah, it's a live dashboard. We used Bokeh. That's a Bokeh server application to build it. So it's got, you know, it's updating at something like a hundred milliseconds frame rates. So it looks live to the human eye. Yeah.

25:05

It's showing you progress. It's showing you every task that gets done and when it gets done and where it gets done. It's showing you memory use on all different machines. You can dive in. You can actually get like line by line profiling statistical information on all of your code. So that coordinate across machines. So like this function is taking this long, but actually it's summing up the profiling across all the cluster. So the scheduler is just aware of everything and the

25:27

workers are all gathering tons of information. There's tons of diagnostics on that machine. Yeah. So maybe a way to say this is that in order to make good decisions about where to schedule certain tasks, DAS needs to have a bunch of information about how long previous tasks have taken, the current status everywhere. And so we've gotten very, very good at collecting telemetry and keeping it

25:45

in a nice index form on the scheduler. And so now that we have all that information, we can just present to people visually. Might as well share. Yeah. I find myself sometimes using DASC just on sequential computations just because I want the dashboard. Like I don't want parallelism. I just want the dashboard. It's just the best profiler I have. It is a good insight into what's happening there. That's super cool. And I love that it's live. So

26:04

it seems really nice. I was surprised at how polished that part of it was. That is honestly just due to Bokeh. If people, if they like dashboards, go look at Bokeh server. I didn't know any JavaScript. I didn't know about Bokeh before. It was super easy to use. I integrated it in my application. It was clean. It was nice. And it looks polished, as you said. That's super cool. Now you did say that you can run DASC locally and there's a lot of advantages

26:26

to doing that as you already described, but it's super easy, right? Maybe just describe, like it's just a couple of lines to set up a little mini cluster running on your system, right? Yeah. So it's even easier than that, but yes. So if you do from DASC, so if you, first of all, you would pip install DASC or content install DASC. It's a Python library. It's pure Python. You would then, if you want the sort of the local cluster experience with the dashboard

26:49

and everything, you say from DASC, I would import client. And then you create a client, which with no arguments, it would create for you like a few workers on your local machine and a scheduler locally. And then all the rest of your DASC work would just run on that. You also don't need to do that. You can just import DASC and there's a thread pool waiting for you if you don't set anything up. So a lot of people, again, don't use the distributed part of DASC. They

27:10

just use DASC locally and DASC can operate just as a normal library. It spins up a thread pool, runs your computations there and you're done. So there's no actual setup you need to do. Right. Sorry. There was a great tweet about maybe a few weeks ago. Someone was using the X-Array library. X-Array is a popular library for array computing in like the geoscience world. Someone says like, yeah, someone was recommending X-Array and DASC on Twitter. And someone says,

27:32

yeah, I've heard about X-Array. Use it all the time. It's great. Never heard of DASC though. Don't know what you're talking about. And it was hilarious because X-Array uses DASC under the hood. The guy had just never realized that he was using DASC. So that's a real success point. Yeah. If you don't have to poke your head up and like make it really hard and obvious that this thing is part of it, right? That's great. DASC had disappeared into being just infrastructure.

27:52

Yeah. So another thing that I thought was a super cool feature about this is you were talking about some kind of working with a really big tree of data, right? And you start processing it and it's going okay, but it's going somewhat slowly and it's a live talk with not that much time. So you're like, all right, well, let's add some more workers to it. And while the computation is running, you add like 10 more workers or something. And it just on

28:19

your cool bokeh graph and everything, you just see it ramp up and take off. And so you can dynamically add and remove these workers to the cluster, right? Yeah, definitely. So DASC is all the things you'd expect from any modern distributed system. It handles resilience, handles dynamically adding things. You can deploy it on any common job

28:37

scheduler. So it does sort of all the things that you'd expect from something like Spark, you mentioned before, or TensorFlow or Flink or any of the sort of popular attribute systems today, but it does sort of adjust those things. So you can sort of sprinkle in that magical dust into existing projects. And that's really the benefit. Yeah. So we focus a lot on how do we add clusters efficiently or how do we add workers to a cluster dynamically?

28:57

Yeah. It seems really useful. Just as you're doing more work, you just throw it in. And I'm guessing these clusters can be long lived things and different jobs can come and go. And are they typically set up for like research groups or other people in longer term situations or are they like sprung up out of Kubernetes, like it burst into the cloud and then go away?

29:17

I'm going to say both actually. So what we'll often see is that the institution, which will have something like Kubernetes or Yarn or Slurm, and they'll give their users access to spin up these clusters. So an analyst will sit down at their desk, they'll open up a Jupyter notebook, they'll import DASC, use one of the DASC deployment solutions to ask for Kubernetes DASC cluster. Then like their own little scheduler and work rules will be created on that machine,

29:39

on that cluster. They'll do some work for an hour or so, and they'll go away. And that cluster will go away. What IT really likes about the sort of adding and removing of workers is that we can add and remove workers during their computation. So previously they would go into the cluster, they'd ask for, hey, I want 100 machines. They'd use those machines for like a minute while they load their data,

29:57

and they would stare at a plot for an hour. And so it's not so much the ability to add new workers, it is the ability to remove those workers when they're not being used. This gives like really good utilization. You can support a lot more people who have really bursty workloads. So a lot of data scientists, a lot of scientists, their analysis is very bursty. They do a bunch of work and they stare at their screen for half an hour. They do a bunch more work and they stare at the screen for half an

30:19

hour. And so that sort of dynamism is really helpful in those sorts of use cases. Yeah, I can imagine. That's super cool. Now we talked about the DAS data frame and DAS array and those being basically exact parallels of the NumPy and Pandas version. But another thing that you can do is you can work in a more general purpose way with this thing called a delayed and a compute, right? Which will let you take more arbitrary Python code and fan it out in this way, right?

30:48

Yeah. So let me first add a disclaimer. DAS data frame and DAS array are not exactly like NumPy and Pandas. They're just very close. Close enough, you'll be fine. Similar, okay. If I don't say that, people will yell at me afterwards. So yeah, so the creation of DAS delayed is actually really interesting. So when we first built DAS, we were aiming for something that was Spark-like. Like, oh, we're going to build this like very specific system to parallelize NumPy and

31:10

parallelize Pandas and that'll handle everything. Everyone was built on top of those things. That did not end up being the case. Some people definitely wanted parallel Pandas and parallel NumPy, but a lot of people said, well, yeah, like I use NumPy and Pandas, but I don't want a big NumPy and Pandas. I'm using something that's like really different, really custom, really bespoke. And they wanted something else as though we had built like a really fast car. And they said,

31:31

great, that's a beautiful car. I actually just want the engine because I'm building a rocket ship or I'm building a submarine or I'm building like a mechatronic cat or something. It turns out that like in the Python space, people build like really diverse things. Like Python developers are just on the forefront. They're doing new things. They're not solving the same classic business intelligence problem over and over again. They're building things that are new.

31:54

And so DAS delayed was like a really low-level API where they could build out their own task graphs, their own parallel computations, sort of one chunk at a time using normal Python for loops. So it allows you to build like really custom, really complex systems with normal looking Python code, but still get all the parallels in the task and provide. So you still get the dashboard, you still get the resilience. You can build out your own system in that way.

32:15

Yeah. You even have the graph and the dashboard, as you said, which is really cool. Yeah. I love it. So when I saw that, I was thinking, well, this is really interesting. And I see how that lets me solve a more general numerical problem. Could I solve other things with it, right? Like suppose I have some interesting website that needs better performance, right? I've got something that has really nothing at all to do with computation. Maybe it talks to databases or other things,

32:41

right? Could I put a DASC backend on like a high-performance website and get it to do, you know, bizarre cluster computations to go faster? Or is it really more focused on just data science? Yeah. So people would use DASC in that setting in the same way they might use like a multiprocessing pool or a thread pool, or Celery maybe, but they would scale it out across a cluster. So DASC satisfies like the concurrent futures interface, if you use concurrent futures,

33:07

and also does async away stuff. So people definitely integrate it into systems like what you're talking about. I think at UC Berkeley, there's a project called Cesium that does something like that. I gave a talk at PyGotham like a year or two ago that also talked about that. So yeah, like people definitely integrate DASC into more webby kinds of systems. It's not accelerating the web server. It's just accelerating the sort of computations that the web server is hosting.

33:29

It's useful because it has like millisecond latencies and scales out really nicely and is like very Pythonic. It fits all the Python APIs you'd expect. Yeah. And the fact that it implements async and await is super cool. If you use it with something like Sanic or Court or one of these async enabled web frameworks, it's just a line, a weight, some execution, and the scalability, you know, spreads maybe just some DASC Kubernetes cluster or

33:54

something. That's pretty awesome. You talked about the latency and I was pretty impressed with how those were working. You were quoting some really good times. And actually you said that's one of the reasons you created your own scheduler and you didn't just try to piggyback on Spark is because you needed extremely low latency because it's very, it's kind of chatty, right? If I'm going to take the dot product of a thing that's all over the place, they've got to coordinate quite a bit

34:16

back and forth, right? That can't be too bad. Yeah. So Spark actually internally has okay latencies today. It didn't at the time, but they're also at sort of the millisecond range where Spark falls down as in complexity. They don't handle sort of arbitrary task graphs. Sort of the other side of that is systems like Airflow, Luigi, and Celery, which can do more arbitrary task graphs. You can say, you know,

34:35

download this particular file. And if it works, then email this person, or if it doesn't work, parse this other thing. You can build these dependencies into Airflow, Luigi, Celery. Where those things fall down is that they don't have good inner task communication. So you can't easily say, I've got these two different pandas dataframes with different machines,

34:51

have them talk to each other. They don't handle that. They also have long latencies in sort of the like tens of hundreds of milliseconds, which can be... Right. Which is not that long, but when you amplify it across, you know, a thousand times, then all of a sudden it's huge. Yeah. Or when your pandas computations take 20 milliseconds or one millisecond,

35:07

and you want to do a million seconds, then suddenly latency does become a big issue. So we needed sort of the scalability and performance of Spark, but with the sort of flexibility of Airflow, Luigi, Celery. And that's sort of where Dask is sort of that mix between those two camps. This portion of Talk Python To Me is sponsored by Backlog from NewLab. Developers know the importance of organization and efficiency when it comes to collaborating on a team. And Backlog is the

35:34

perfect collaborative task management software for your team. With Backlog, you can create tasks, track bugs, make changes, give feedback, and have team conversations right next to your code. You track progress with features like Gantt and burndown charts. You document your processes right alongside your wikis. You can integrate with the tools you use every day like Slack, Jira, and Google Sheets. You can

35:54

automatically register issues from email or web form submissions. Take your work on the go using their top-rated mobile apps available on Android and iOS. Try Backlog for your team for free for 30 days using our special URL at talkpython.fm/backlog. That's talkpython.fm/backlog. I was really impressed with the low latency that you guys were able to achieve. And so maybe that's a good way to segment over to some of the internal implementations. You talked about using G-event,

36:23

Tornado. What are some of the internal pieces that are at work there? Yeah. So on the distributed system, we use Tornado for concurrency. This is because we're supporting both Python 2 and 3 at the time. Tornado is becoming more asyncIO friendly. That's sort of becoming more common. So Tornado for concurrency. Also Tornado a little bit for TCP communications, although we've had to improve Tornado's bandwidth for high-performance computers.

36:46

That's cool. So has Tornado gotten a little bit better because you guys have been pushing it to the limit? Yeah, definitely. That's good. Yeah, that's good. Dask touches a ton of the ecosystem. And so Dask developers tend to get pretty comfortable with lots of other projects. So Antoine Petroux used to work on Dask a lot. He did a lot of the networking stuff in Python, and he also worked on Tornado and did lots of things. So he was handling that.

37:06

So yeah, Tornado for concurrency and TCP communication. So all the Dask workers are TCP servers that connect to each other. So it's kind of like a peer-to-peer application. A bunch of just raw, basic Python data structures for our internal state, just because those are the fastest we can find. The Python dictionary is actually a pretty efficient data structure. So that's how we handle most of the internal logic. And then for computation, if you're using

37:29

Dask data frame, we're using pandas for computation. If you're using Dask array, we're using numpy for computation. So we didn't have to rebuild those things. And then, you know, compression libraries, encryption libraries, Kubernetes libraries. There's sort of a broad set of things we end up having to touch. I can imagine. Is the cluster doing like HTTPS or other types of encrypted communication and stuff like that between the workers and the supervisor?

37:50

Yeah. So not HTTPS. HTTP is sort of too slow for us. We tend to use TCP. Okay. Dask does support TLS out of the box, or SSL, you might know it as. But the comms are also pluggable. So we're actually looking at right now a high-performance networking library called UCX to do like InfiniBand and other stuff. Security is standard in that way, assuming you're happy with TLS. Yeah. Yeah. I suspect most people are. Until the quantum computers break it, and then we have a problem.

38:14

Sure. Comms are extensible. We can add something else. It's like a quantum state algorithm. Yeah. Yeah. I suspect we're going to be scrambling to solve the banking problem and e-commerce problem faster first, though. That'll be where the real fear and panic will be. But luckily, that's not here yet. So this just sounds like you guys are really pushing the limits of so much of these technologies, which sounds like it must be a lot of fun to work on. Yeah. It's been a wild ride. I can imagine.

38:41

Yeah. It's actually been interesting also building conventions and protocols among the community. I think Dask is at an interesting spot because we do talk to everybody. We talk to most scientific disciplines. We talk to banking. We talk to Fortune 500 companies. And we see that they all have the same problems. So I think recently I put together a bunch of people on file formats. Not because I care

39:04

deeply about file formats, because I had talked to all of them. We also talked to all the different library developers. And it's really interesting to see, you know, OK, Tornado 4.5. Are we ready to drop it or not, for example? And we talked to the Bokeh developers, the Jupyter developers, the DAS

39:16

developers, the Tornado developers. It's a well-integrated project into the ecosystem. I want to sort of like hang there for a moment and say that because, again, Dask was designed to affect an ecosystem, move an ecosystem forward, it actually has like its sort of tendrils into lots of different community groups. And that's been a really fun thing to see. Sort of not on a technological side, just on a social side. It's been very sort of satisfying and very interesting to see all these

39:39

different groups work together. Yeah, I can imagine you could have worked with not just the library developers on all the different parts of the ecosystem, but you're also on the other side of the scientists and the data scientists and people solving really cool problems, right? You'd be amazed how often like climate scientists and quantitative traders have exactly the same problems. You know, that's something I noticed doing professional training for 10 years. You know,

40:01

I taught a class at Edwards Air Force Base. I taught one at a hedge fund in New York and I taught one at a startup in San Francisco and like 80% that's the same. It's all the same, right? Like it's good computation. It's algorithmical thinking. It's like distributed systems. And then there's the special sauce that probably I don't even understand anyway, right? But it's super interesting to realize that how similar these different disciplines can be.

40:25

Yeah. And from an open source standpoint, like this is new because previously all those domains had like a very different technology stack. Now they're all using the same technology stack. And so as a person who sort of is paid to think about software, but also cares about advancing humanity, it's like a really nice opportunity, right? We can take all this money from these quant traders, build that technology and then apply that technology to genomics or something.

40:48

Right. Or climate science or something that like really needs to get it right quickly. That's sort of a nice opportunity that we have right now, especially in Python, I think that we can hire a bunch of people to do a bunch of good work that advances humanity. Also helps people rich, makes people rich playing the stock market, but you know, we can let that happen. Hey, yeah. If they're helping drive it forward, you know, it might be okay. Sure.

41:11

For the good, maybe that's debatable, but probably. It sounds like such a cool project. Now maybe let's talk a little bit more of the human side of it. Like obviously you work a lot on it, as you said, but who else? Like, are there a lot of people that maintain to ask or what's the story there? I've worked on it for a while within, so we started within Continuum or what is now called Anaconda. And there's a bunch of other engineers within Anaconda who also work on Dask.

41:34

They do sort of halftime something else and halftime work on Dask. Are they still? Definitely. People like Jim Criss, Tom Augsburger, Martin Durant, and others throughout the company. And so yeah, they'll maintain various parts of the project or push it forward. So like Jim recently has been working on making Dask easier to deploy on Hadoop or Spark systems using Yarn. And that's

41:51

something that is important for Anaconda's business and also just generally useful for people. So there's a sort of nice overlap between helping out the open source project and also getting paid to do the work. Yeah. Outside of Anaconda, so I'm now at NVIDIA, there's a growing group at NVIDIA who's also working on

42:05

Dask, which is pretty exciting. And also there's a bunch of other people who are working in other research labs or other companies who need to use Dask for their work and need to build out the Kubernetes connection or something like that. And they maintain that both because it's good for their company and also they just sort of enjoy doing that. So I would say in sort of the periphery of

42:23

Dask, most of the work is happening in other companies and other people. And there's hundreds of people who do work on Dask in any given month or two. There's probably a lot of people that make like one or two little contributions to like fix their edge case, but it all adds up, right? There's that. There's also maybe like 20 or 30 people who are maintaining, not just adding a fix, but like maintaining some Dask foo package. So there's like, there's a researcher, Alistair Miles,

42:48

who works in the Oxford area, who works on genomics processing. And he has like a Dask enabled genomics suite called Psyched Allel. So that's his job, right? So he's working on that day to day. There's, you know, the scikit-learn developers maintain the Dask scikit-learn connection through Joblib. So Olivier Grissell has commit rights over all of Dask. He's

43:07

the main scikit-learn contributor. So there's a bunch of sort of, as Dask expands out into the rest of the ecosystem, that it's sort of being adopted by many, many smaller groups who are in other institutions. So it sounds to me like there might be some decent opportunities for folks to contribute to Dask in the sense that, you know, maybe there's somebody that's doing some kind of research and their library or something like it would really benefit from having a Dask level version of that.

43:36

So if people are looking to contribute to open source and the data science space, are you guys looking for that kind of stuff? Yeah, definitely. I mean, you don't have to look these days. People are just doing it. What I would say is that there's probably, in whatever domain you're in, there's probably some group that's already thinking about how to use Dask in that space and jump onto that group. There's a lot of sort of like low-hanging fruit right now. Pick your favorite PyData library.

43:59

Think about how it could scale. And there's probably some small effort around that currently. And there's a lot of, you know, if you want to get into open source, it's a great place to be. You're like right on the very exciting edge of big data and scalability. Things are pretty easy today. We've solved a lot of the like the major issues by with sort of the core projects like NumPy and Pandas and scikit-learn. So yeah, there's a ton of ton of pretty exciting work that's like

44:21

just starting to pick up now. So yeah, if you're interested in open source, it's a great place to start. It seems like there's so many data science libraries that there's probably a lot of opportunity to like bring some of them into the fold. Pretty low-hanging fruit there. So let me ask you some kind of how wild is that type of thing. One is what's the biggest cluster you've ever seen it run on? The biggest I've had- Or heard about it running on?

44:45

Yeah. I would say like a thousand machines is probably the biggest that I've seen. And that's probably like with an order of magnitude of limits today. But those machines are big machines. They have like 30 cores on them. Yeah. Like that's also just mostly for benchmarking. Like most problems in data science aren't that big. Right. Thousands sort of. That's awesome. I was thinking of places like CERN that have their big grid computing stuff and things

45:08

like that. Like they could potentially spin up something pretty incredible. That gets used on a fair number of supercomputers today. But it's very, very rarely used on like the entire supercomputer. There's really no system outside of MPI that I would expect to run on the full like CERN federated grid. So most of the sort of data source oriented libraries end up sort of maxing out around a thousand machines. I mean, hypothetically you go higher, but the overheads will be enough to

45:36

kill you. Does it have a sort of a, like how many people's hands do you have to shake to shake everyone's hand in a group? You know, like that in factorial type of issue. As you add more, does it get like harder and harder to add the next worker to the cluster or is it okay? No, it's fine. Everything's linear in terms of number of workers. Workers are, it's not a, like a mesh grid. It's workers talk to the scheduler to find out who they have to talk

45:56

to. They'll then peer to peer talk to those people. But in any given computation, you only talk to like, you know, five to 10 people at once. Okay. It's fine. Yeah. Interesting. Okay. The other one is what's the coolest thing that you've seen computed or run on Dask or built with Dask? I'm going to say, there's things I can't, yeah, things I can't talk about, but. Some are, the most amazing ones are super secret.

46:16

I'm going to point people to the Pangio project. So this is actually my most recent Okay. PyCon talk in 2018, talking about other people's work. So Pangio is pretty cool. Pangio is a bunch of earth scientists who are doing climate change, meteorology, hydrology, figuring out the earth. And they have these, you know, 100 terabyte petabyte arrays that they're trying to work with. They haven't quite gotten to the petabyte scale yet.

46:35

And yeah, so they usually set up something like Jupyter hub on either a supercomputer or on the cloud and Kubernetes. They then analyze tens or hundreds of terabytes of data with X-ray or Dask around the hood and get out, you know, nice plots that show that indeed sea levels are rising. I think it's pretty cool in that it's good community of people. So there's like a bunch of different groups, NASA, universities, USGS, UK Met Office, anyone who's looking at

47:01

large amounts of data, companies, plant labs, et cetera. So it's a good community. They're like figuring out how to store one copy of the data in the cloud and then have everyone compute on that using cool technologies like Kubernetes. They're inventing cool technologies like how do we store multi-dimensional arrays in the cloud. So yeah, like that's maybe a good

47:19

effort. Not necessarily cool computation, like from a Dask perspective, it's actually pretty like vanilla, but it's cool to see a scientific domain advance in a significant way using some of these newer technologies like Jupyter hub and Dask and Kubernetes. And I like to see that repeat it. So I think we do the same thing with medical imaging and with genomics. So that's maybe like a cool example. It's not so much a technical situation, it's a community situation.

47:43

Yeah. Well, it sounds like a cool example that could be just replicated across verticals. Definitely. We're starting to see that, which is really exciting. Okay. Those are great. So when I was looking into Dask, I ran across some other libraries that are like, kind of like you described, DaskFu. So there was DaskCuda, DaskML, DaskKubernetes. Do you want to just talk a little briefly about some of those types of things? Sure. And why you might use them?

48:06

I'll go in reverse order. So DaskKubernetes is a library that helps you deploy Dask clusters on Kubernetes. So you pointed at like a Kubernetes cluster and you say, go forth and use that and it could just create the pods and run them in all that? Definitely. So it creates pods for all the workers. It manages those pods. If you ask it, there's like little buttons, you ask it to scale up or scale down. It'll manage all that stuff for you

48:27

using the Kubernetes API. It presents users with like a super simple interface. They can open up a Jupyter notebook and just sort of play around. So that's what that does. And it has peers like Dask Yarn for Hadoop Spark clusters and Dask JobQ for more high-performance computing schedulers like Slurm,

48:42

PBS, SGE, LSF. So that's what you ask for Kubernetes. That's like on the deployment side. DaskML is an effort to, there were a lot of sort of interesting ways of using Dask with machine learning algorithms, things like GLMs, Generalizing Your Models, XGBoost, scikit-learn. That were all different technical approaches. And DaskML is like an almost like a namespace library that collects all of them

49:03

together, but also using the sort of scikit-learn style APIs. So if you like scikit-learn and you want to scale that across a cluster, either with simple things like hyperparameter searches or with large data things like logistic regression or XGBoost, you should look at DaskML and it'll help you do that. Dask CUDA is a, that's like a new thing I've been playing with since moving to NVIDIA. People have been using Dask with GPUs for a long time and they've developed like a bunch of weird wonky

49:27

scripts to like set things up just the right way. There were a ton of these within NVIDIA when I showed up. It's just a few like nice convenient Python ways to set up Dask to run on either one machine with many GPUs or on a cluster of machines with lots of GPUs and how to make that easy. So it's kind of like Dask Kubernetes in that it's about deployment, but it's more focused on deploying around GPUs and there are some intricacies to doing that well.

49:51

I imagine you can scale like crazy if you put the right number of GPUs on a bunch of machines and let it go. No, GPUs are actually pretty game-changing. It's fun. Yeah. It's kind of near the limits of human understanding to think about how much computation per second happens on a GPU. I think CPUs are also well beyond the limits of human understanding. But this is like the galaxy versus the universe, you know, like... Sure. They're both outside of our understanding.

50:22

I should like mention that I work for NVIDIA now, so please don't trust anything I say. But yeah, GPUs are like fundamentally like a game changer in terms of computation. Like you can get a couple orders of magnitude performance increase out of the same power, which I mean, from like a science perspective, like has the opportunity to unlock some things that we couldn't do before. GPUs are also notoriously hard to program, which is why most scientists don't use them. And so that's sort of

50:44

part of why I work at NVIDIA today. They're building out a GPU data science stack and they're using NASK to paralyze around it. And so it's fun to see... It's fun both to sort of grow NASK within NVIDIA and have another large company behind it. But also from a science perspective, like I think we can do some good work to improve accessibility to different kinds of hardware like GPUs that might dramatically change the kinds of problems that we can solve, which is exciting in different ways.

51:08

Yeah, very incredible. And I would much rather see GPU cycles spent on that than on cryptocurrency. Sure. Although cryptocurrency is good for NVIDIA and all the GPU, the graphics card makers, they don't seem as useful as like solving science problems on them. All right. We're getting close to our time that we have left. So I'll ask you just a couple of quick questions before we wrap it up. So we talked about where NASK is. Like, where is it going? It

51:34

sounds like there's a lot of headroom for it to grow. And NVIDIA is investing in it by sort of this work that you all are doing. And it sounds like they're also building something amazing. So yeah, what's next? NVIDIA aside for a moment, I'll say that NASK technologically is sort of done. The core of it, there's plenty of bugs and plenty of features we can add, but we're sort of at like the incremental

51:52

advancement stage. I would say where I'm seeing a lot of change right now is in broadening it out to other domains. So I mentioned Panjio a little bit before. I think that's a good example of NASK spreading out and solving a particular domains problem. And we're seeing, I think, a lot more of that. So I think we'll see a lot more social growth and a lot more applications to new domains. That's really what I'm pretty excited about.

52:12

That's cool. So it's like NASK is basically ready, but there's all these other parts of the ecosystem that could now build on it and really amp up what they're doing. Yeah, let's go tackle those. Let's go look at genomics. Yeah, for sure. Let's go look at imaging. Let's go look at medicine. Let's go look at all of the parts of the world that need computation on a large scale, but that didn't obviously fit into the sort of Spark,

52:33

Hadoop, or TensorFlow regime. And NASK, because it's more general, can probably handle those things a lot better. Yeah, it's amazing. I guess, let me ask you one more NASK question before we wrap it up real quick. So NASK, is NASK 100% Python? The core part, yeah. Or near? Yes. You're probably also using NumPy and Pandas, which are not Python. Which, yeah, right, as we discussed, but yeah. NASK itself is pure Python.

52:54

I think it's really interesting because you think of this really high-performance computing sort of core, right? And it's not written in Go or C, but it's written in Python. And I just think that's an interesting observation, right? Python is actually not that bad when it comes to core data structure manipulation, tuples, list, dictionaries. It's maybe like a factor of two to five slower, not a hundred times

53:17

slower. It's also not that bad at networking. It does that decently well. Socially, the fact that Python had both a data science stack and a networking stack in the same language made it sort of the obvious right choice. That being said, I wouldn't be surprised if in a couple of years, we do rewrite the core bits in C++ or something, but we'll see. Sure. Eventually. When it's kind of finely baked and you're looking for that extra 20, 30% here or there, right?

53:38

Yeah, we'll see. We haven't yet reached the point where it's a huge bottleneck to most people, but like the climate science people do have petabyte arrays and sometimes a scheduler does become a bottleneck there. Right. It's cool. It's interesting. So final two questions. If you're going to write some Python code, what editor do you use? I use Vim, not for any particular reason. I just showed up the computer lab one day and said,

53:57

how do I use Linux? And the guy there was a Vim user. So now I'm also a Vim user. I think I haven't moved off just because I'm so often on other machines. So like rich editors just don't move over as easily. Yeah. I'm on the cluster. I need to edit something. It doesn't work to fire up PyCharm and do that so easily. Right. Yeah. So Vim for now, but not religiously, just anecdotally. Sure. Sounds good. Yeah. I'm sure you do some stuff with Jupyter Notebooks as well every now and then.

54:23

Sure. I'll maybe also put a plugin for JupyterLab. I think everyone should switch from the classic notebook to JupyterLab, especially if you're running Dask because you can get all those dashboards inside JupyterLab really nicely. It's a really nice environment. Oh yeah. That sounds awesome. All right. And then notable PyPI package. I'll go ahead and throw pip install Dask out there for you.

54:40

So I'll cheat a little bit here. I'm going to list a few. First, in a theme of the NumPy space, I really like how NumPy is expanding beyond just the NumPy implementation. So there's Kupy, which is a NumPy implementation on the GPU. There's Sparse, which is a NumPy implementation with Sparse arrays. And there's Aura, which is a new date-time D type for NumPy. I like that all of these were built on the NumPy API. We're built

55:04

outside of NumPy. I think the kind of heterogeneity is something we're going to move towards in the future. And so it's cool to see NumPy develop protocols and conventions that allow that kind of experimentation. And it's cool to see people picking that up and building things around it. So that's pretty exciting. Those are great picks. It's cool that I learned one API and now I can do other things, right? Like I know how to do the NumPy stuff, so now I can do it on a GPU.

55:26

Right. Or I've got my NumPy code. I want to have better date times. I can just add this new library and suddenly all my NumPy code works because that's sort of extensibility, which is something that we need, I think, going forward. Yeah, absolutely. All right, Matthew. Well, this was super interesting. Thank you for being on the show. You have a final call to action. People want to get involved with Dask. What do they do?

55:44

Try it out. You should go to examples.dask.org. And there's a launch binder link on that website. If you click on that link, you'll be taken to a JupyterLab notebook running in the cloud, which has everything set up for you, pretty much of examples. So you can try things out, see what interests you, and play around. So again, that's examples.dask.org. Yeah, and I also link to the Dask examples GitHub repo that you have on your account in the show notes. Yeah, so people can check that out too.

56:07

Awesome. All right. Well, thank you for being on the show. It was great to chat with you and keep up the good work. It looks like you're making a big difference. Great. Thank you so much for having me, Michael. Talk to you soon. Yep. Bye. This has been another episode of Talk Python To Me. Our guest on this episode was Matthew Rockland, and it's been brought to you by Linode and Backlog. Linode is your go-to hosting for whatever

56:27

you're building with Python. Get four months free at talkpython.fm/Linode. That's L-I-N-O-D-E. With Backlog, you can create tasks, track bugs, make changes, give feedback, and have team conversations right next to your code. Try Backlog for your team for free for 30 days using the special URL talkpython.fm/Backlog. Want to level up your Python? If you're just getting started,

56:52

try my Python Jumpstart by Building 10 Apps course. Or if you're looking for something more advanced, check out our new async course that digs into all the different types of async programming you can do Python. And of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never expires. Be sure to subscribe to the show.

57:12

Open your favorite podcatcher and search for Python. We should be right at the top. You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the direct RSS feed at slash RSS on talkpython.fm. This is your host, Michael Kennedy. Thanks so much for listening. I really appreciate it. Now get out there and write some Python code. Bye. Bye. Bye. I'm out. Bye.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript