The 3 traps of open source funding models | Wes McKinney (pandas, Voltron Data, Posit) - podcast episode cover

The 3 traps of open source funding models | Wes McKinney (pandas, Voltron Data, Posit)

Jun 25, 20241 hr 9 minEp. 42
--:--
--:--
Listen in podcast apps:

Episode description

From creating one of the Python’s most influential libraries to co-founding Voltron Data, Wes joins the show to chat about why the book cover of the pandas book doesn’t feature a panda, open source pitfalls to avoid, the pros and cons of hiring engineers at a non-profit, and more.

 

Segments:

(00:02:50) Guang’s complaint about the pandas book cover

(00:04:38) Quarto and Open Access Publishing

(00:12:00) Convincing Wall Street to Open Source

(00:15:31) Publishing the first python package over Christmas 

(00:18:01) Doubling Down on Building pandas

(00:23:23) Personal sacrifices for the sake of impact

(00:26:28) The Evolution of Open-Source

(00:29:19) “Open source development started out as a very privileged activity”

(00:32:40) The Consulting Trap

(00:35:17) The Startup Trap

(00:39:29) The Corporate User Trap

(00:44:21) Avoiding the Startup Trap

(00:46:54) Non-Profit vs. For-Profit

(00:48:09) The Challenges of Hiring Engineers in a Non-Profit Setting

(00:50:08) The Benefits of Remote Work for Open Source Development

(00:52:15) Balancing Open Source and Enterprise Interests

(00:57:25) New Funding Models for Open Source?

(01:00:01) Getting into VC

(01:06:19) The Future of Composable Data Systems

 

Show Notes:

- online edition of pandas book: https://wesmckinney.com/book/

- the new digital publishing tool that Wes recommends: https://quarto.org/

 

Stay in touch:

👋 Make Ronak’s day by leaving us a review and let us know who we should talk to next! [email protected]

 

Music: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en

Transcript

Creating open source software, it's very difficult and for me it's been very emotionally draining because there's a lot of, like, you have to soldier through the dark days of the project where there's not that many people that care and you have a conviction and a belief that what you're doing is important and that's going to have impact. But that impact is going to be realized far into the future. Like, work that you're doing today, you're not going to see the impact of that or feel recognition or see the value that work for at least six months.

And so it's very deferred gratification. I think this goes back to making open source your location, that should be a 40-time job if you're under open source. Going back to the second project you talked about, which is a startup trap. Can you tell us more about that?

Yeah, the startup trap is where you create a company, you raise some venture capital and you build a product that is either an explicit commercialization of the open source project or you build a sort of vertical solution that's powered by the open source project. And so there's a couple of issues that can happen here.

Welcome to the software and misadventures podcast. We are your hosts, Ronnock and Guan. As engineers, we are interested in not just the technologies but the people and the stories behind them. So on this show, we try to scratch our own edge by sitting down with engineers, founders and investors to chat about their path, lessons they've learned, and of course the misadventures along the way. Sweet. Yeah, Wes, thanks so much for joining us on the show. Thanks. Thanks for having me.

So as the creator of pandas, you wrote the book, Python for Data Analysis back in 2012. I really liked how he was when I was learning Data Engineering back in 2015 this was. So thank you for that. But I would like to file a complaint about the cover of the book. So for context, the books was published by O'Reilly and O'Reilly books all feature like these different animals. Can I just say how sad, you know, I am that the feature animal was not a panda but...

It wasn't, it wasn't you to stick, it was like a weasel, like what kind of animal was it? Yeah, I, it's funny when I was working on the book, I, you don't get to, as an O'Reilly author, you don't get to choose the, you don't get to choose the animal on the cover. So I, you know, I suggested I was like, so just to, you know, say it would be cool to have a panda on the cover. I think what they said was, oh, we're saving the panda for like, like something really big. What?

Which is kind of, well, it's funny because I think that the book ended up being way more successful than anybody expected. Because when you go back, like when I originally got the book contract with O'Reilly was in like November 2011 and it was a little bit, it was definitely experimental at the time. I don't think anyone had a really clear idea whether Python was going to become a big deal and mainstream data analysis or what we now call data science.

So I think the fact that the book has been so successful and you know, it's, it's been translated into like 10 languages and has sold. I don't even have the full count, but my guess is like 300,000 copies, like kind of like ballpark, like maybe more when you account for all of the subsidiary translations and things like that.

But I think that's because it's become a reference textbook for a lot of university courses and so that creates, creates consistent around, around the globe demand for the book. And it's funny. Sometimes I get emails from people who live in a country with, with sanctions and aren't able to, for example, I've gotten emails from people in Iran and they say, I pirated your book.

I'm sorry, is there some way that I can, some way that I can pay you and it's like literally like they could not buy the book because of, because of sanctions. But now with the third edition, the contents freely available online and that end given the book is, is now 12 years old and I've tried to update it and keep it relevant and keep up with the changes that, that happen in pandas.

It reminds me that I have like a pending queue of eroded a fix and to get a new printing out to fix like, is basically the way it works is you make a newer version which contains the major edits to the book and then as overtime you fix little things and then they'll update things at the printers. So so people get like little bugs fixed in the book, like patch releases, I would call them. Was it hard to convince them to, hey, we're going to have just have a free on PDF like for the last edition?

It was somewhat tricky. I think the fact that there were, there was precedent for open access books like R for data science is one example. So when Hadley Hadley wrote our for in his co author at our for data science, that they wrote they had the book available for free as it was being written and they had the stipulation with their contract that we will only do this with Riley if we're able to release it as open access.

And I think that helped show that having the open access version actually doesn't hurt book sales like print sales as much as you might think. And I've been I actually expected that having it available for free would reduce print sales, but to my surprise, the print sales have been pretty stable or maybe they were heard a little bit and maybe the market got bigger because there's more and more people doing.

Python, but I got permission to release the book in in one and only one open access format. So I got to pick whether it would be Jupiter notebooks on GitHub or a website or whatnot. And so I choose the I chose the website because I thought that having the SEO and and the ability to like go to West McKinney.com slash book and search the whole book, you know, and get instant results was pretty like a pretty useful feature.

So and you know, JJ layer helped me port the book to Corto, which is an awesome new technical publishing system that I've been recommending to everyone. And so partly why the reason why the book looks so nice and so so easy to browse and search on the website is because of Corto.

Very cool. But we're going to carto when you say it's a new digital publishing tool. Can you say what about this? I'm just curious what this is. I've never heard of it before. Yeah. So if you go to court to work, it's a it's a language and dependent technical publishing system that under the hood, it's powered by it at the core of quarter. You have Pandoc, which handles translation between different document formats.

But Corto has become a pretty big software project that handles creating books and blogs and websites and you can use basically write a book using Jupiter notebooks and then use Corto to stitch the notebooks together to create a book like structure.

Corto handles all the orchestration of rendering the Jupyter notebooks converting the output of the Jupyter notebooks into the appropriate like the necessary output format given your book publisher. So for example, a Riley media uses asky doc and doc book XML as their input formats for publishing.

And so Corto knows how to go from Jupyter notebook with various tags and mark down cells and code and everything and you can add special annotations within your Jupyter notebook to handle particular things that have to do with a Riley's tool chain. But my book was written in written in doc book XML in 2011, 2012. And so I actually got like really good at writing XML and I have all these like have all these e-max shortcuts for generating XML tags for doc book XML.

But it's not something that I would recommend everyone, but it's something that I was just forced by necessity to get good at, but what basically what we did what JJ helped me do with Corto is write a pan doc filters, which are written in Lua to convert the book from doc book XML into Corto markdown, which is markdown plus some extensions and customizations for Corto.

And then I can use Corto to render the book to a PDF or to a website or in principle any output format. So originally like the history with Corto is that JJ and his collaborators created to have created multiple other dynamic web like publishing systems.

So there was cold fusion in the 1990s, which is one of the original dynamic web page frameworks along with CGI and PHP. And then he and his company created what ultimately became Windows live writer. And then they created our markdown early on in in our studio, which turned into positive.

But our markdown was a basically technical document publishing framework where you could write markdown interlaced with our code and it would handle all the rendering and output to different formats. So Corto is kind of a reimagining of all those things built under a very modern foundation.

And it generates like portable portable binary. It ships a whole JavaScript runtime uses deno, which is like kind of the fancy rest based node.js runtime. But it's very easy to deploy. And yeah, I think it's it's a really cool project. And so I've also been encouraging a lot of like open source projects to migrate their project documentation and websites to use Corto because we did that for the I bis project, for example, and it generated really good results.

And that's what I was thinking actually I just navigating the book on the website. It looks really good actually and super easy to navigate. I was thinking for at least many teams, they could use something like this for internal documentation, for example, to make it look such nice.

Yeah, so you can think about like creating internal websites using Corto and actually one of one of posits enterprise products is called connect, which is a basically a secure publishing system for internal publication. So there could be documents, Jupyter notebooks, really anything you could create with Python or R with Corto can be published, published dynamically to connect and you can set up fine green permissioning.

So imagine like you had some Corto document or some set of documentation, you only want it to be visible to one team with inside your company and you want to set it up to deploy from a GitHub repository, something like that, that's something that you can do with something you can do with connect. So it's all inter connected. But if you want to use Corto to like generate a confluence page and put it in confluence, if you're like an Atlassian customer, that's something you can do also. Super cool.

Cool. Cool. Cool. So, sorry, coming back. Yeah, good of a tangent. I'm a huge, I'm a huge, I'm a huge, in summary, I'm a huge fan. So, no, I'm different. Link it in the show notes for a quick check it out. So a couple of years back, you wrote this post announcing Ursula apps in which you mentioned these three traps about people working in open source in terms of like healthy gift funding.

I thought this was really cool because it kind of ties different parts of your career sort of together. I like how you're like, yeah, I have direct, directly experienced some very, all these problems. I'm a big believer of experiential learning. So I think that's the only way to really get understanding of problems.

So I thought that we can kind of go into these different traps and do what you think's together. So the first one is the consulting trap and I think that kind of maybe ties back to like pandas. So to kind of get us started, this is like early on in your career after college, so you're working finance at a hedge fund. And that's where you start building pandas and eventually made it public.

And then shortly after that, you actually decided to pursue a PhD in stats, a Duke. So you mentioned this yourself that financial institutions, if they're not really charitable to open source. We're both very curious like, how did you manage to convince them to like open source it?

It wasn't easy to convince them. I will say that in the last 15, I guess 17 years since I first got involved in working finance in the mid 2000s, that financial firms have become a lot have seen the value of making things open source. And so not only a QR or a R I work, but to Sigma Bloomberg, Jane Street, these companies have released a lot of open source software. And but to get companies like this that value their intellectual property so highly to dip their toes and open source was not easy.

I think at the time it took maybe six months of like discussions and convincing. And ultimately I made the argument that yes, like we'd be giving away potentially some secret sauce that would help our help the companies competitors, like be able to work with data more easily. But I also talked about the likelihood that Python would become more widely used. I think at the time like D.E. Shaw, for example, had begun to use Python for certain things.

And so it's a little bit of a cost benefit. So if you release a piece of open source software, you have a better chance of your thing becoming the main thing. And that creates a lot of network effects and value within the open source ecosystem. But if you don't open source and then somebody else, somebody else open source is their thing. And then that becomes popular than you're sort of on an island.

And so it's like building bridges and doing trade with your neighbors versus having a very isolationist mindset. And so there's definitely pros and cons to to the different approach. If you create something really valuable, maybe you want to hoard that invention and use it to use it to your maximum benefit. But there's also downsides.

And I think it helped that I was very I was very keen to engage with the open source community. And so I made the argument that I would use pandas like early pandas as a tool to better engage with the open source community, use it to recruit people to come work at the company. Maybe if the project became popular, then people would learn how to use it and they would want to come work at the company to be able to have a job where they could use pandas as part of their jobs.

And thankfully, I think all of that is basically has basically come true. And so now a QR can hire new college grads and they show up on their first day and they know how to use pandas and Python. And they can be productive working for the company pretty much right away, which is very different from the old way that many financial firms used to operate, which is which they have these very proprietary tool sets proprietary data analysis tools and systems.

And so new employees would face a pretty significant learning curve to be able to get up and running. And there was like a lot of debates about licensing. I think I think some of the lawyers wanted to use the GPL and of course, like the Python ecosystem is not very GPL friendly. And so if you put GPL on something, a lot of Python users almost as a matter of principle or not going to touch the library because they're concerned about like the viral effect of the GPL.

Eventually, we agreed on using a new three clause BSD license and putting it out there. But I think I initially started having the conversation about open sourcing it and some time in early to mid 2009. And it was only able to really push it through at the end of I think the first pandas 0.1 was released on New Year's Eve 2009.

So I was ignoring my family or friends, or I can't remember where I was to get it up on pipi. I'll tell you the anxiety of publishing my for a very first Python package. It was pretty intense. It got easier after that. But yeah, the first time was hard. What was that like? So this was like literally new years to you. And then you're just like pushing it or how did that go?

Yeah, I mean, if you look at if you look at it's you know, pipi.py.org. Oh boy, there's so many like projects that are not pandas in pipi, which is, let's see here pandas three. My goodness. I swear the Python package index is full of is full of malware. But yeah, if you go all the way back to know it was I had wrong it was released on Christmas.

Christmas 2009, which is even worse in the sense of like of neglecting neglecting my family. But I don't remember where I was at the time, but you know, I had time over the holidays. And so given that the open source side of pandas started out as my side project and site interest.

As long as continue to be maintained and work well for my job, that was enough. Nice, nice. And I mean, that's like a lot of work to go to have gone through right six months of like discussions and you know meetings and then really pushing for it. I think a lot of people would have given up right, especially at that point. He wasn't clear that this is going to have the kind of impact that he does today. What gave you the conviction of like, you know, this is worth me like pushing for.

Yeah, I'm trying to place myself back in that back in that mindset. So it's been 15 years, but yeah, I felt that there was a lot of potential that I thought there was a lot of potential in Python clearly. And. And so I wanted to pull on that thread and see and see where it went. Like because I think because what the stuff that we had created on top of pandas and the whole.

Like research set of research tools that we created in Python were so much better than what I had used was using prior to that. So I felt like there was this potential to create have like a really transformative effect on people's productivity or just making data analysis data science a lot more accessible and making it open source.

And so I had I had I clearly had a strong conviction and and I yeah, it was something that I really wanted to do and see about. But yeah, I would say that language for a little while for maybe language is maybe putting it strongly, but for about a year because I got busy doing some other things I applied to grad school. I started I started PhD and it was only when I started getting contacted by by other companies who wanted to my advice on switching to Python from other things that I realized that they're.

That like this was the time and it's now or never and I need to spend all my time on this to help the ecosystem develop into something that that something that people can can adopt and be successful using. I mean, it wasn't just pandas. There were a lot of other things that needed to fall into place to make it all happen, but pandas was an important part of the important part of the solution.

I see that must be cool to give validation right even after all this time to have the in the inbound interests. Yeah, I mean I by my kind of self deprecating way of looking at it is that I you know I was in the I was in the right place at the right time and it's you know,

certainly more than making the right place at the right time like I had to take a lot of actions in order to make it in order to make it happen. I had to make you know personal sacrifices. I sacrificed my sacrifice my personal life like I took time away from friends and family to work on it. I made significant career diversions because I believe that it would create more interesting opportunities for me in the future.

I could have continued to work and work in finance and had a very comfortable and lucrative career working in quant finance. So some of the people that that I worked with in those years like are still some of them are still still involved with the same you know the same companies that I I collaborate with in those in those days.

So I could have stayed on that path but I I chose to I chose to take a risk and and put a lot of energy into it basically sweat equity I suppose and I but I was in a very fortunate situation I had no I had no student loans which is like I think I'm under under appreciated benefit and that I was able to take a risk and I wasn't I wasn't digging myself into that much of a financial hole I had some savings from like I lived very frugally in my first few years of working and had maybe like you know I was going to be a little bit more careful.

I was going to be like you know a year's worth of living expenses saved up and so I was when I told myself at the time was okay I'm going to work on this full I'm going to work on this full time I'll find like a little bit of consulting work on the side to help pay the bills but don't do too much consulting that I'm not able to spend most of my energy improving pandas and then after a year so I can see where I find myself whether you like whether this makes sense or like whether I'm getting the kind of return on return on my time like return on investment that the justifies continuing to do this.

Right right by the way I like that reflection I love your post that you wrote when you turn 30 just to like reflect on things one sentence that stood out to me was like right you talked about MIT it was more about yeah being smart and then like in New York it was like being wealth and then are also in San Francisco like we had a very similar kind of conversation with Joshua's about that and I thought I was going to work on that.

And I thought yeah it was quite cool that you're like you know given all that let me try to figure out like what do I actually want what makes me happy I thought that was very impactful.

Yeah I think in you know going through all these different professional situations and deciding how to spend my time and what to work on I think there is like an underlying like search for meaning search for like what like what actually matters to you like do you value like recognition or fame or what you're doing. And you know what I think in my opinion or fame or you value money do you value comfort like what are like what are the underlying motivations the things that.

Things that will make you feel satisfied like be happy with happy with your life and I you know I think in in retrospect I you know I went a little too far at times and and made some significant personal sacrifices I will say like in my 20s my personal relationships suffered.

I have a bit of an obsessive personality and anyone who knows me well is familiar with with that side of me like oh like Wes has his projects and sometimes he sometimes the projects become like an obsessive like an obsessive focus. And so I think learning to find some learning to find some balance and the importance of like relationships and friendships and things like that I think it was it was good that I went through all that I think it was very helpful personal growth.

But I've learned about myself that I'm very motivated by very motivated by impact and to be able to have impact in a sustainable way but I also have to take care of myself like I have to be a like a happy and like resilient person like if I'm depressed all the time and like don't have like the can't bring myself to care enough like to start a new project or to like drive so drive forward projects like through.

The tough times because creating open source software like it's very difficult and for me it's been very emotionally draining because there's a lot of like you have to soldier through like the dark days of the project where there's not that many people that care and you have a conviction and a belief that what you're doing is important is going to have impact.

But that impact is going to be realized like far into the future like work that you're doing today you're not going to see the impact of that or or feel recognition or see the value of that work for at least six months like probably even more than that.

And so it's like it's very deferred gratification. So you have to tell yourself okay this is tough like I gosh like the bill keeps breaking and like oh the release is like this windows build and there was like a dark time when like building stuff on windows

this was really hard and so every time like I would fire up virtual box to build windows binaries of be like this really is really sucks like why it's like why must I go through this misery and often like that that that it's like silent suffering because no and you can

always tell someone like you know like oh man sucks that I had to like spend four hours like fixing the windows build and like getting these binaries out so I could release and people often remind me like oh you chose this life like if you wanted to if you wanted to be more comfortable or to not have like to be all on your own building or just like feel chronically like understaffed working on these projects and making them happen you know this was a choice and I guess that helps to remind

yourself that it's like it's always a choice and yeah if you're not happy with you ultimately happy with what you're doing yeah there's going to be like good days and bad days but hopefully you have more more good days and bad days I don't know what it's like I think it was like Steve jobs who said if you have

like a certain number of bad days in a row or you know it doesn't seem like you're not getting any positive feedback then you should probably you need to make changes just to yeah to follow up so one do you think obsession is an important ingredient to push like projects like these where like you said right it is so hard to have the conviction of like it can be I mean for me that's just my personality and so I don't know that it's an essential ingredient ingredient

it worked for me I think that unobsessive personality can also lead to unhealthy behavior so like early on like earlier in my life like I got involved in video game speedrunning and so we were playing the game gold night 007 and that that is a special kind of obsession to play the same stretch of a video game hundreds of times in a row to try to try to

get the fastest fastest time and like perfect all the little details in order to set in order to set a record or break your previous personal best and so I think I fell into those kind of like obsessive patterns and patterns of self improvement and efficiency and yeah it's very much like yeah I've been that way since I was since I was a child so not something that I

would recommend everyone and I don't think it's the only way to to do open source successfully and particularly now that open source has become a fixture and a strategy for businesses so I think the model of like the the lone what like obsessive lone wolf hacker working on their nights and weekends to build a project is more or less going by the wayside and I think also it's become harder

and harder for individuals to mount successful efforts because we've solved a lot of the easy problems and so in many cases it was like okay well we need just need an open source solution and an individual can scrap together an open source solution to this problem relatively straightforward and a reasonable amount of time but what if you have a problem that is much more difficult that requires that needs 50 person years of effort or 100 person years of effort

so an individual can't possibly do 100 even if they are 10 times more productive than the next person are they overwork or they work 80 hours a week or 100 hours a week maybe they can muster in in one year the same amount of work output that somebody else might do in three or four years but you want to deliver results on the order of single digit years rather than you know 100 years or 25 years or something like that so

I think as the problems have become more difficult it's required a different approach and reject and explicitly rejecting the lone wolf like the lone wolf mindset which was a feature of like the early days of pandas but I think there's fewer and fewer projects like that that being said like you know we have pullers and python which was it which has been a lone wolf project from Richie think until you know until recently he he founded a company and is now hiring people to help him but

so we still do see successful scenarios like that but it would be disappointing me if that was like the only way to be successful in open sources to like do engage in this like objectively unhealthy behavior and I think a lot of my yeah like like I said earlier a lot of the stuff that I I think I did I definitely made a lot of decisions and worked at the expense of like my mental and physical health in my 20s and so I've had to make a mindful choice to to reject that and to not continue to do that

myself also I'm getting older and you know I can't I can't work long hours like like I used to and I need sleep like you know and I have I have other things that I like doing in life so anyway balance is a good thing and so be to be able to build important open source projects

while also having balance in your life I think is something we're striving for so I want to ask this question and this is a recurring theme I've seen so aspects of what you said are late two in terms of sometimes being so narrowly focused on one problem that you neglect everything else at the cost of your personal life at times

and then many folks we've spoken to on the podcast this the theme comes up is like early in the career yes super driven super focused on this one problem made a lot of progress but then also resulted into self awareness which like hey this is not really sustainable but that surge in the initial period does result in impact recognition or even I would say future opportunities that you weren't thinking about at the time at that time we just wanted to get this thing to work

so when this aspect of balance comes in I say that when I've seen almost this advice consistently that make sure you have that balance so that you have some extra energy in your full to do other projects or your behaving well your personal life is good but for people who are starting out would you say that they should it's okay to have that narrow focus yeah in balance in life it's like hey that's okay if you don't have let's say for example no student wants to worry about

your family you're responsible for yeah maybe it's okay go crazy it's I mean for yeah it's important to point out like it's things that I did when I was 25 I think wouldn't be practical for a lot of people like they have they have a family they have a family to support

or maybe they have student loans to pay like they have other obligations in their life that that makes it hard for them to work from 7 p.m. to 1 a.m. every day and after and they have that if you have a demanding job then spending time in your nights and weekends

maybe you need to work a second job to make ends meet and so I think it I think fundamentally like early story of open source software I think part of the reason that the open source world has significant inclusivity and diversity issues is indeed because open source development is fundamentally a privileged activity or started out as a very privileged activity

and so I think what's great now is that large companies have and startups and large companies have made open source an essential part of their strategy Microsoft from the Steve Balmer days has transformed itself into being a very open source friendly company and Guida Fenerosum like works at Microsoft working on making see Python faster and Microsoft has made enormous contributions to the open source world and out of like the major tech companies like the Magnificent 7 I would say that Microsoft is being able to make a lot of money and

that Microsoft is probably the best place to go and get be able to work on open source software for a living and so that that means that to take the software development yes it's you're giving away software building software and giving away for free for free on the internet but also it allows people to be able to have more balanced lives to treat it as a job rather than

like something that's coming at the expense of like your friends and family and like your life outside of your day job and and so I think that's it's essential and I yeah I think that it would be better for the volunteer model of open source to more or less go away because it's not very sustainable it leads to significant

maintenance problems maintainer burn out is particularly when somebody's working on a project outside of a day job or some other responsibilities that they have and so it's common that you see maintainers volunteer maintainers burn out so one of the solutions to maintainer burn out is for people to do open sources their job

and yeah so I think Linus Torval to work on the Linux kernel just worked on the Linux kernel is this full time job for a long time and so yeah I think yeah I recognize like I did the lone wolf thing like I did a lot of volunteer I early days eventually I've I arranged to get paid to work on open source and so that's made things I've been continuously paid to work on open source projects in you know in the last you know eight or nine years but but that was partly a reaction to

like the open source model is like this is going to cause me to be burn out and miserable and like I need to make this my vocation like my profession and so like I've given a lot of talks and have written a lot about how it is important for open source to become like a true vocation like a job and not something that's like this privilege activity that people do on their free time. Yeah so great lead into so the first the trap of consulting truck can you tell us more about that.

Yeah so the consulting trap is where you get you have an open source projects and the project and you find consulting gigs or consulting projects where you work for a company that's using the open source project and maybe they partly are paying you to fix bugs and customize the project for their needs but what can happen is that you end up spending a lot more time working on the company's internal software projects.

You become more or less a software developer of that company and your work on the open source project can become incidental or something that you do on the side or ideally you would spend 50 50% of your time doing working on building custom software building things for the company 50% of the time on the open source project or even more time on the open source project but

it's not uncommon to see the shift and it being 10 20% of your time on the open source project and 80 90% of your time building building custom solutions for the client. I've seen that happen happened a number of times and so it's yeah there's good situations and bad situations I've seen very productive very productive open source consulting type relationships.

I think it's gotten easier as time has gone on but I think I think nowadays when a company engages a consultant who is an open source maintainer they understand that that partly what they're doing is paying this person to work on the open source project because maintaining it is good for them as well but but it's still it's still a risk and I think it's a trap in the sense that some fraction of the time you end up being kind of a substitute you know more or less fungible employee working within that company and that the work on the open source project is something that's on the back burner.

Like ways to avoid that trap as someone that's getting started doing that would that be just being very clear about setting time boundaries and how you should allocate your time at the contract. Yeah I think it's just being clear about the expectations and the contract and the statement of work and yeah setting clear boundaries I think yeah sometimes yeah if people go into the contract with yeah just the kind of ex if it's sort of hand wavy like yes yes like

improve the improve the open source project keep fix bugs and things like that it's easy to underestimate how much time that how much time that really takes and so yeah so just yeah I think setting those boundaries or the expectations that say if it's your goal to spend 50% of your time on the project on a steady state that you have that to carve out and you protect that time.

Yeah I think it goes back like I'm making open source your location that should be a full time job if you're on a open source going back to the second trap that you talked about which is the start of trap can you tell us more about that.

Yeah the start of trap is where you create a company you raise some venture capital and you build a product that is either an explicit commercialization of the open source project or you build a some kind of a vertical solution that's powered by the open source project. And so there's a couple of issues that can happen here so so one issue is where you create a conflict between the needs and the business needs of the startup and the open source project and its user base.

And so that would take the form of I've seen any number of things from license changes to holding back features like basically maintaining a private fork of the project and reserving like pro features or features that you don't want to release to the open source project because it will it might undermine your edge in your business.

There can also be governance challenges because there can be governance challenges because you as a startup you want to be able to move fast you don't if your goal is to create a healthy relationship with the contributors that are outside of the company. It does create an implicit negotiation with contributors that are not your colleagues and so what can sometimes happen is that the company will become like a you know pejorative term would be like a backroom call.

So they communicate in private they decide to make changes and then they push through and they push through changes in the project without getting the buy in and convincing convincing the other maintainers and so so the other contributors might feel demotivated because they feel like second class citizens if they're not working at the startup that that is commercializing the open source project.

Another thing that can happen that is also very common is that the investors in the startup can can take operational control of the company as a result of firing firing CEO or company losing the founders losing board control and that that may lead to a shift of shifting of budget and more or less like a you know developers being laid off or reallocated to work on other parts of the company that are you know deemed to be more important.

To be more in line with generating a return on investment for return on investment for the investors and so so sometimes you can see like okay there's the company is really engaged in this project and then at some point there is a shift of there's a leadership change or there's some other shift in the company status and then the developers just disappear and it's like well I my boss says I have to work on something else and so suddenly like you no longer getting paid effectively to work on the open source project.

So suddenly getting like defunded to work that can definitely happen and relatedly I mean projects can also be dependent on development infrastructure provided by a company and so that that can create another source of risk that if that suddenly suddenly disappears then yeah so anyway we've seen like we've seen all these things and this is one of the one of the issues that that causes communities to fork like if they don't if they you know if they like this like a fork like this happened with.

Pret with presto like the sequel engine so so there was the the fork to presto DB and Treno and this was wasn't a startup issue per say but it was provoked by by my understanding was provoked in part by a governance conflict between. Meta Facebook and it's in the open source community of developers working on the project who did not work at who don't did not work at Facebook at the time.

Yeah that was interesting to see by the way to actually see presto being fork to Treno I read the post I think at five year anniversary for Treno they wrote about some of this historical context and how Treno came to be and this was one of the things they highlighted there like these were the reasons for actually doing a fork and if you look at things right now at least I know at length and we used to use presto very heavily but since this fork or the last I want to say at least three plus years we have been mostly using Treno I shouldn't say completely but most of it like a lot.

Bigger part of it in press like it is moving there and you see that community shifting order Treno as well I was following that space for a while so it saw some of this shift at the time okay the third trap the car for trap car for user trap rather can you tell us more about that yes like the big company company trap that is similar I think there's similar stories to to the startup trap I think in the corporate user trap I think.

There I think what you see there is that that's it's easier for developers to get to shift around or get moved off of a project and so developers shift in and out of working on like I was just looking at some component in Microsoft open source project and there was a developer who just left Microsoft and so essentially this did disappear from the project and so I guess this could happen with developers a developer working at a startup that's working on open source project but particularly in big companies priorities and budgets can change on a.

Quarterly to annual basis and so this can and some companies are notorious for for their priorities you know shifting or being somewhat flippant so especially in the center when I'm sure yeah and so whenever a project becomes it to dependent on the generosity of a particular big company that can also become a source of risk because you're dependent on having the support of a particular vice president or senior.

Vice president or senior vice president who believes that the project is important something important for the company to to be maintaining and contributing to but that that could change based on the.

The system of the company and its quarterly performance and things like that so and yeah and then you also see some of the some of the government some of the governance conflicts where decisions are how decisions are getting made like there's product managers involved and like other corporate apparatus and so yeah it's again open like big corporate open source can be done well I mean look I think Microsoft has done test on outstanding job but I think we've seen plenty of scenarios where where things have gone things

have gone the other way and I mean look at I think if you look at the like my sequel Maria Maria DB there was there was a community fork in part because of yeah because of bristling or challenges working with working with Oracle I think right

Yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah. And so it's it's a very common story and particular when an open source project is part of some product line or is related to some you know is related to some profits and of the business ultimately corporations corporations have in most cases have a obligation to their shareholders and so yeah that comes in that can easily come in conflict with the with the needs of the open source community.

I think over time as you mentioned like this idea of having lone wolves working on an open source project is changing with a bunch of companies doing open source and in many cases I think the successful open source projects you see are not the ones which have only one company behind it the ones which have multiple big companies behind it because not one company will have dominance or

will won't be able to go in the entire project themselves it becomes more of a community thing and you're not dependent on only one company at that point and this is something we see very commonly in many of the cloud offerings that companies build on top of so like the open source products that companies build cloud offerings and top of

where multiple companies are incentivized to improve that offering for example and that essentially pipe or translates to some of the things they're offering as a cloud but it's not necessarily true everywhere like I don't know if you saw this recently in the exe compression library there was this backdoor injected where this person did like social engineering for what I don't know three years or something like that I might be

missing that yeah maybe two years yeah yeah I think the exe libelsia may thing that was you know the levels of investigation it must have been a must have been a state actor like a whole whole shop of black hat security hackers creating obfuscated backdoors into you know a some important kind of component in the Linux

and the Linux supply chain I think one thing I yeah I we I guess we didn't really mention this how the patent at times predatory relationship between the major cloud vendors and open source projects and that's precipitated license changes and like the anti a to bs licenses like source available licenses like you may do anything you want but with this open source project except operate a

operate a cloud service in a company with more than ten billion dollars a year in revenue and so there's only like a handful of companies that was there started with a yeah yeah yeah so I think this has like a the corporate part kind of has like a cool tie into kind of your decision of leaving and joining

the deposit but before getting into that I want to kind of rewind a little bit to the the startup trap and so you founded or what led to Valtron data given sort of these challenges like how did you deal with it when you're starting a company yeah so we worked on so we created Ursula labs in 2018 which was a not for profit development group was funded by our studio now which is now positive and to sigma and Nvidia until like some other financial

firms like Bloomberg so I wanted to create like a nonprofit industry consortium to fund arrow development and that was going great for a couple of years and we were seeing significant demand to put a lot more firepower into the arrow ecosystem and companies that were interested in having support like a formal relationship development

relationship with a company behind the arrow ecosystem and so it was an interesting challenge to to set up a company to create pursue a product vision but to you know create guardrails and to like have that startup trap in mind like how could we build like how could we build an open source team that's driving forward progress in the arrow ecosystem and some of the peripheral projects while at the same time having investors and doing enterprise product

development and so I think we you know I think partly it it helped that when we created Valtron data that we had a very clear expectations with our investors that open source was a huge dimension of how the company would be successful over time that creating open standards and protocols and building this open source composable data stack was an essential aspect of how we would be successful and so for people who are not aware like what the company is doing is

while we do enterprise support and open source partnerships for the arrow ecosystem but the company also builds a accelerator native like GPU accelerated execution engine which can be

incorporated into into data different data processing systems to make to enable essentially enable modular GPU acceleration and it's all arrow based and and so it's something that needs to be able to plug into all of these different systems and so to develop these open source projects and standards and protocols to make that all work seamlessly is an essential aspect of how that how that will succeed so getting that buy in from investors and help us avoid avoid the startup trap

and you know the company has a team of 20 some developers who are largely working full-time on open source and so over a period of many years so to be able to invest decades of person years and in the open source ecosystem has been a has been a game changer for arrow in the

business you mentioned this company was not for profit no this was a slab so we are so labs were functionally like a satellite of our studio positive so we operated independently we they handled the back office like payroll health insurance for us based employees things like that

and so in 2020 we spun out from we spun out from from our studio to create earth computing and we raised a venture round in August 2020 and then at the at the beginning of 2021 we we joined up with the leadership from rapids and blazing sequel to sort of mash everything together and we created a new brand identity

and we raised more money for for ultra on data a couple rounds kind of one in the seed round in 2021 supersede seed to I guess we'd raise the seed for earth computing and then a series a and January 2022

I see the reason I was asking is because from a not for raw perfect like if that is the case then it might become harder to hire engineers because at some point you have to figure out compensation for people working on this and if it's not competitive enough as compared to other companies for example then you don't have the right quality of engineers working on the problem

Yeah that's true and that was I think that that was indeed a challenge in the in the Ursula labs era that that they're there were really talented engineers that I was interested in hiring to work work full time in in the in the arrow ecosystem and and simply because of of the economics of of Ursula labs like the funding model on what we could afford to pay in terms of salaries and and there was no in a really no equity to offer

because it was you know not not for profit endeavor and so you know I think we had a great team but to be able to to scale up and also to hire you know to hire people who could easily go work for you know the big tech companies or Google and make a lot more money so I think that that was partly that was partly the motivation

not only to have a larger team to be able to put more resources objectively into into arrow development but also to be able to hire individuals that have a lot of a lot of career opportunity so I guess historically what it be fair to say that one of the con has been right compensation since you can't offer stock but in terms of pros in addition to the mission the flexibility right in terms of location

because I feel like a lot of the great people that have helped me I think was first getting into Kubernetes like there were like two people that really helped me out and then they were just like living in the middle of nowhere in the states where I imagine I guess at least a few years back you would have been difficult to kind of go to like a bigger tech if you want to have that lifestyle

but then I guess that's also changing now since companies are more open to remote with that be fair to say that's yeah that's definitely true I think I think COVID COVID definitely helped with changing culture as far as like hybrid you know hybrid and remote but yeah I've been you know working on a remote remote only capacity for

the last yeah six six six years or so and you know it has it has pros and cons but for you know for for open source development it's ideal because you can hire people where they are I've worked with like a lot of people and a lot of people in Europe and I think you know Europe is really friendly for friendly for open source developers because health insurance is separate from employment

and so if you are you know if you are in between maybe in between full time jobs and you want to pick up like a contract to do some open source development that's something that you can do without putting your family's health at risk whereas in the United States I think there's this there's definitely a psychological burden of losing continuity of health care coverage and that does you know lead people to

to you know to not not not make decisions like that and so having managed a global you know global workforce you know people around the world in different countries so I've gotten to see like the different like the psychological impact of that you know yeah that the health insurance question has on people so I think open source will be much better off if everyone had had you know at least a guaranteed level of you know a basic health care

right right interesting and just going back a little bit so about Voltron Data so you mentioned you were able to avoid some of the startup trap when you were being very clear and with the venture funding how did you so like being very specific we've avoided we've avoided it for now yeah maybe not forever but we're yeah we're we're really doing our best

and we wish to be good stewards in the open source projects that that that were involved in and I think by choosing investors that understand that as well I think is is a you know part of ensuring that that will remain the case right so like being specific about like how do you deal with because one of the issues I saw to me that's like oh yeah that is very hard is like how do you balance like which features to open source versus what to keep for your interest

and so are so versus what to keep for your enterprise version like how did you guys go about making those decisions at Voltron Data well at Voltron Data I mean anything related to core arrow or anything that is like projects that we want people to like interfaces protocols like we've like we've developed been working a lot on database better database connectivity like ADBC which is the arrow database connectivity API standard

and flight SQL which is a wire protocol for databases to offer SQL support and then we've you know gone and partnered with you know partnered with snowflake for example to integrate that into their drivers and to make aeronative you know connectivity work better for better for snowflake users and so there it's like all of the all of the pieces of technology related to that need to be need to be fully open source and so there's nothing that's there's nothing that's held back

I think the company's main product Thesis it's a GPU accelerated modular execution engine I think there's there's very clear separation between like this this this system that runs on a rack of you know rack of you know A100 or H100 GPUs it requires Kubernetes like it requires like a you know basically an enterprise data center type type setup setup to

to use and so it's a pretty clear to delineation between like software that's involved with you know building and operating building and operating Thesis and also like the types of users like you need to have certain types of hardware available to use the system at all

and so I think at least at the moment it's not open source it may be that it becomes source available or in some capacity in the future I you know it's hard for me to predict and you know it but that there's you know it's it's a specialized product for

organizations that have very large data sets and like the over 10 terabyte type type data sets where you can get 10X or 50X performance improvement or efficiency improvement by using using racks of GPUs to do that processing or maybe you've got a data center like you

built a sort of infrastructure for doing LLMs and and machine learning and you wish to also do be able to do your analytics and feature engineering directly on that on that hardware so you can shorten the whole pipeline run it on one you know sort of consistent set of hardware and get a lot better

performance that way so in a sense like the the kind of the marketer the user base for that type of system is is a lot narrower than say you know say pie arrow which is you know a Python library and has you know millions of users

and downstream and you know tons of downstream projects that depend on it so yeah so I think ultimately it comes down to a question of like like who is the audience like who who are the potential users are people is something like a project you intend people to build other open source projects on top of or is it like a solution kind of an end solution in and of itself so like for example like IDE's like development environments are a lot less sensitive to

copy left licenses like the GPL because in a sense like the development environment is itself it is in self an end so you could build extensions to it but but people don't really need to depend on depend on the project in a in a kind of sense that you would like a project like non-pire pandas where like this is these projects are like essential library dependencies of building something else and so

if they were if they had GPL licenses that would constrain constraint use and see you know same logic applies to close for software so it's like you know it's like what you know what what aspirations do you have for a piece of software and so I think and so you know when we you know

we made all of our early decisions in you know what to build and and you know licensing and things like that with both your data ultimately like our decisions were about like how do we how do we grow the arrow ecosystem how do we make

the composable data data composable data stack happen you know happen faster enable these modular pieces modularization you know like what open standards are missing like how do we design those open standards how do we build libraries to make it easier for people to use them

and and so that's you know so there's yeah so we've been very busy with a lot of things from substrate to arrow to these you know arrow kind of new protocol projects I bis for Python yeah our need open source for print the company is pretty significant nice nice and this might sound a bit naive but do you like as open source become more critical components of any software business do you envision like innovations in terms of funding models so like

Patreon but like it really comes like I don't know really something that makes it like that's kind of me opens up to a new traps but like that's a bit different from like what we've seen before yeah I think that's something we didn't really get into is like other other funding models people have had for you know for open source but there's open collective there's there's Patreon there's GitHub sponsors there's a new there's new platform called polar

which is kind of like a GitHub sponsors or patreon alternative and so there and there been a number of developers that have been able to successfully support themselves and get and get a lot of get a lot of sponsorship these ways

it can be hard to get you know big dollars to be able to pay a full time full time team of developers but but in a number of cases you have individual project maintainers that are able to support themselves as individuals if they have like a prominent prominent enough role in the project

I think what partly what they've been doing is is monetizing access to themselves either creating like exclusive content or like having a you know private Slack channel or a private discord where if you're a sponsor or patron of the project that you get exclusive access to talk with the developer about about your needs

and then the other one is like you know if you're on GitHub like anyone in the world can can open an issue and bother you at any time of any time of day or night so yeah so there's so there's been some successful there's been some successful examples of of of doing that and I think that the these models like open collective and that that crowdfunding or you know crowdfunding platforms for open source support

and we didn't have them we didn't have them a decade ago so it's it's it's it's been a big it's been a big improvement for projects that been able to make it work interesting so a bit of a hard pivot I guess so you've recently became a general partner at composed ventures which does early stage investing in data and for any companies

I feel like throughout your career like right you're like really good at picking up new skills like you know open source writing a book like building a company so I imagine venture capital is not like a new skill that you're like building so I'm kind of curious like what's like generally what's the approach that you use to like learning new skills that like you've developed over time and how you're applying that to venture

yeah so I so there are a couple of things there so I mean as we as we built out the era project and made it more successful people start reaching out to me to get feedback on projects that they were thinking about founding or new projects that they're working on and just for getting my advice on on technical matters or asking me for favors for other things and at some point I started asking them if they you know let me invest in their

in their their funding rounds whether their friends and family round or their their seed round or something like that and I and I was never investing a lot of money but you know it was just a way for me to be involved and to have some some skin in the game to you know help help help you know help help the help these companies be be successful and you know as time as time went on in the ecosystem has gotten a lot bigger I think there's a couple of things so partly I wanted to be able to

I decided some investors got in touch with me and wanted wanted me to invest some of their money on on their behalf and into the you know types of investments that I've made in the past since I have an interesting network and can get in touch with companies maybe before when they're raising only a small amount of money before they go to raise like a larger like a larger round

and I also wanted to create more like content and messaging around the like the super trend of these composable data systems and the composable data sac like what we've seen with you know modular acceleration similar to what similar to what we're doing the

full trend data but also we're seeing modular acceleration projects out of out of meta and out of out of Apple and and different things there and so basically what's happened is that people are building new versions of old products but with these really high quality off the shelf open source components and that's in a sense like what we wanted to happen and so

the fun gives me a way to you know invest in those companies and but also to create more awareness of this this trend that is taking place with all these different companies which are building on building on arrow we're building on some arrow off shoot projects like data fusion which is a rust based embedable query engine modular query engine building on duck DB or things like that because we've worked very hard to like you know to to enable all these pieces to exist and

to make them fit together nicely and so it seems like a healthy sort of ecosystem shift that's taking place and so so this gives me a way to be involved with with founders and to help companies you know get off the ground but also for for people to be like aware of like you know different people working on different approaches to solving you know old problems with these new kind of open source tools.

I see I do like it so far. I mean my goal is for it to not become a full-time job. I so it's something that I'm doing part time and you know my full time engagement is with full time engagement was was with positive I still I'm an advisor full-time data I advise a couple of other couple of other companies

lands DB union dot AI so so I have you know kind of one leg and in the startup venture world and then you know one leg is a you know it's a software architect at at posit but yeah I so I I enjoyed it so far you know the fun just just started in in January and so

so I've made a couple of you know first couple of investments but it's yeah I I don't currently have plans to become a full-time investor or to raise a large a large fund but to have a small fun that enables me to you know right you know medium sized like angel checks or like super angel type type checks and be helpful to founders yeah I think gives me a meaningful way to be involved and

you know and yeah I maybe you know maybe the investments will make maybe the investments will make money but I'm not doing it is like a you know is a way to become you know strictly I'm clearly like you know putting you know putting capital at risk and so I hope that that you know the investments will make you know as much more money than you know buying real estate or

investing in the stock market but you know my my primary goal is like I wish to accelerate innovation in the space and help help people succeed. Okay so I did like a little incubator a few years back had a terrible idea so I feel like I'm qualified to ask this question so let's not pick on the worst idea you've heard but what's the second worst idea that someone's pitched you.

The worst idea that somebody has pitched to me. The second worst but the second worst. Well I have a hard time remembering but yeah I probably wouldn't be appropriate for me to share. I don't know what anybody's feelings. Sorry, sorry I'm trying to know my bet. That's fine I'm actually going to ask you mention this new trend which is composable data sacks so I've I'd work more on the computer and press side less on the data and press side I just know a little bit about the data space.

And historically I've seen a lot of these projects being open source all the way from like storage to like I do for example our processing layers like spark and then streaming layers like blink and then you look at data formats. Product buffs and drift what not Apache arrows another example when you see data scientists wrangling data they use pandas numpy.

So in my mind data stacks have been composable but I'm not sure what you mean by the new trend so I really great if you could describe what you were referring to. Yeah so the general idea is that is building a building a system while making use of as many open standards or protocols for different layers of the layers of the stack.

So for example at your storage layer projects like parquet and iceberg so iceberg is an open standard for kind of an open source data lake format that's interoperable across many different execution engines parquet. You know an open standard for file format for for analytic data storage there's execution engines which can be the goal ultimately is to be able to hot swap or to be able to sort of a hand off work.

Like choose which execution engine to use based on like what will deliver the best performance or the best efficiency for for a certain workload and so to be at the query optimization level or the user interface level.

If your user interface and your your query optimizer is loosely coupled to the execution engine into the storage this enables you to make like a different decision about like which engine to use and and kind of other decisions about you can also incrementally make improvements to the stack or incorporate new components in a way that's less disruptive.

And so it's challenging right now because I think some of these these things are still in in their early days but they they're rapidly developing and and so you know is our hope that you know kind of in the coming years that that that it will be a little that building systems like this will be a bit less bleeding edge and like a more obvious and like the you know what's considered to be the best like the best choice for how to build.

New new data systems makes sense it sounds like even open surfer projects have this thing of buy in and away like are for lock in in a way is like yes you can change it but changing is super expensive. What it sounds like is these modular systems can make it easier for you to swap one art versus the other. Right that's right makes sense well this has been an awesome chat thank you so much for taking the time today we learned a lot through this conversation and I'm sure I've listened to you.

Thank you so much for joining this show. Yeah thanks for having me I enjoyed it. Thanks so much. Hey thank you so much for listening to the show you can subscribe wherever you get your podcasts and learn more about us at softwaremissadventures.com. You can also write to us at hello at softwaremissadventures.com we would love to hear from you. Until next time take care.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.