Monorepos. You've heard the talks, you've read the blog posts, maybe you've seen a few glimpses into how Google or Meta organized their massive code bases, but it's often in the abstract and behind closed doors. What if you could crack open a real production monorepo, one with over a million lines of Python code and over a hundred sub-packages, and actually see what's being built step-by-step using modern tools and standards?
Well, that's exactly what Apache Airflow gives us. On this episode, I sit down with Yarek Patuk and Amag Desai, two of Airflow's top contributors, to go inside one of the largest open-source Python monorepos in the world and learn how they manage it with uv, pyproject.toml, and the latest packaging standards, so you can apply the same patterns to your own projects. This is Talk Python To Me, episode 540, recorded February 10th, 2026.
Welcome to Talk Python To Me, the number one Python podcast for developers and data scientists. This is your host, Michael Kennedy. I'm a PSF fellow who's been coding for over 25 years. Let's connect on social media. You'll find me and Talk Python on Mastodon, Bluesky, and X. The social links are all in your show notes. You can find over 10 years of past episodes at talkpython.fm, and if you want to be part of the show, you can join our recording live streams.
That's right, we live stream the raw, uncut version of each episode on YouTube. Just visit talkpython.fm/youtube to see the schedule of upcoming events. Be sure to subscribe there and press the bell so you'll get notified anytime we're recording. This episode is brought to you by our Agentic AI Programming for Python course. Learn to work with AI that actually understands your code base and build real features.
Visit talkpython.fm/Agentic-AI. Hello, hello, Yarek, Amag. Welcome to Talk Python To Me. Awesome to have Amag, you here, and Yarek, you back. Very nice to be again at Talk Python To Me. It's one of my favorite podcasts I listen to all the time. Thank you, thank you. It's my first, but yeah, thanks for having me, Mike. Happy to have you here. You and a team of people, given the scale of this project, have built an amazing, amazing product with Apache Airflow.
It's going to be really fun to dive into it, and specifically, we're going to focus on not building workflows exactly, although I'm sure we'll talk about that somewhat. The real goal, the thing that we're going to focus on, is how do you manage such a big project with so many different internal packages that all depend upon each other and so on, and monorepos, and that. I've touched on monorepos before, but two things. I think this makes a really interesting discussion for listeners out there.
One, this is going to be very concrete with exact steps, and it's even open source. You can go check it out and play with it. And two, the tooling and the standards have changed significantly since I talked about this three or four years ago, making much of what we're going to talk about possible, right? Absolutely. Yeah. Now, before we dive into that, of course, let's do quick introductions. Jarek, it's been a while since you've been on the show. Who are you? Tell people who you are.
I'm an Apache Airflow maintainer, one of the DMC members as well, and also one of the Apache Software Foundation members. I've got this nice, thin new logo of Apache Software Foundation that we got at FOSDEM. I'm also an Apache Airflow security committee member, which is an important aspect for what we are discussing today because of supply chain and dependencies and lots of security, potential security issues these dependencies bring.
One of the few lucky people to contribute to open source full-time and get paid for it, which is amazing. Maybe another podcast one day about that, because I think that's also an interesting one. Yeah, I have something like that, a topic somewhat like that brewing. So, yeah, potentially to have you back for that. Hey, I'm Amok Desai. Again, similar to Jarek, I'm a PMC member and a computer at Apache Airflow.
And I'm also part of, I'm one of the top 10 contributors to the project, top 10 all-time contributors to the project, Jarek being number one. So I work at Astronomer as a senior software engineer, where I get to live in both worlds. One is contributing to Airflow's code development and also supporting the companies that are trying to run Airflow at scale. Awesome. What is Astronomer? Tell people about that.
It's a company where most of our, we're a company which is almost one of the leading contributors to Apache Airflow and also the leading consumer of it. We supply and we provide a managed distribution of, corporate managed distribution of Apache Airflow inside Astro. And yeah, I think we have a data platform as well to try and make your lives easier to use Airflow at scale. And let me ask to it, two comments.
So Airflow has a number of stakeholders and commercial stakeholders who are hosting Airflow as a service as well. And, you know, like using Airflow, we have contributions from all over the place. Astronomer by far, like the biggest number of contributions and fantastic open source stakeholder. We like very much focused on making Apache Airflow, like truly vendor neutral Apache project. Like I'm always amazed how well this works. And the second thing, the number one, I'm cheating a bit.
Like, you know, I do a lot of small PRs. This is how you get the number one. I guess it depends how you measure it, huh? You know, you could always just do one ginormous AI PR that's like a hundred thousand lines of code in your PR and people would love you for it. And you'd be a mega contributor. Oh, yeah. Well, not. He does both. The funny part is Yarek does both. His velocity amazes me or, I don't know, shocks me sometimes. He does massive PRs and also like a lot of tiny ones.
And by the time I'm looking, there are like three more out of it. I don't know how he does it. We're going to get to a bit of how much traffic there is on Airflow in terms of like open source activity. It's some, it's a little bit. Before we move on though, Yarek, what is the Apache Software Foundation? What is this Apache thing that you're talking about? And why is Airflow part of it? Very quickly. It's a foundation. One of the oldest foundations, open source foundation in the world.
25, 26, seven years now. I think the main thing about Apache Software Foundation is that it's individual driven. So every member is an individual, not a corporate, as opposed to like Linux Software Foundation, where members are corporates. And people make decisions in both foundation and projects or NPMCs, so-called project management committees. And Airflow is one of the PMCs. So one of the project management committees, which has PMC members.
We are both PMC members and we have like 50 other individuals or 60. I can't remember like the number of changes. Like we are inviting new ones all the time. We make decisions as humans, as individuals, not the corporates who are employing us, for example, because it's a meritocracy-based system where people have merit and the merit doesn't expire and the merit doesn't belong to individuals, not to the corporates.
That's one of the big, like pretty much all the open source software out there, like has some Apache Foundation or Apache Complement in it. It started with Apache Server 20, 30 years ago almost. But now we have more than 200 PMCs. We just passed 10,000 committers mark two months ago, I think. So like lots of individuals, lots of people contributing to the foundation. And the main thing about foundation is community over code. So we value building communities more than actually producing code.
We believe producing code is just byproduct of great communities working together. And ASF is a charity, is a public good charity in the US registered in Delaware. So we actually cannot be sold. We cannot change our license. Nothing like that can happen because of the status of foundation. And a really positive force for open source, right? Oh, absolutely. Absolutely. When I first got into like learning how ASF works, I said like that it has no chance to work. Like there is no way it works.
It's too idealistic. There's no way. Absolutely. And nobody in the foundation who makes decisions gets any money. So like everyone is a volunteer. All the PMC members, all the committers, all the board members, all president, all the VVPs, those are all volunteer driven roles. And those are the people who make decisions. We just pay a few people in infrastructure and security. That's basically it. Let's start by just talking about high level abstract. What is a monorepo?
I think it's so easy to make that sound like the same thing as a monolith. You're like, oh yeah, monorepo, monolith, same thing, right? And yet you're shaking your head. The first time I met personally a monorepo, maybe I can continue with that, but that was like at Google. I worked at Google years ago and I was surprised coming to Google that all the code there is in a single monorepo. Even though like we have like hundreds of products and all the stuff you see.
It's got to be a lot of code, right? Like a giant, giant repo. Like now they have like maybe four. I don't know. Like I've heard some stories. I don't work there for a long time now. But that for me, that was a sign that like you don't really have to split and dice and slice your repositories into many, many small ones.
Even if you have like non-monolithical product, it all can be kept in a single source, single repository, separate source trees maybe, separate like we'll talk about how we do it in Airflow. But it's a way how you can bind it together and have it tested together and have it developed together. Even though each piece is pretty much separate and you can work on them separately. That's the monorepo.
As opposed to multirepo, which is like when you have multiple repositories consisting of whatever comes up as a product. Yeah. Everything that Jarek said plus just a small addition, which is each of the component or the tiny bit of a monorepo can have its own build artifacts, its dependencies. It can also have its own release cycle or a release vehicle. That's the only addition, but everything is put together as a big puzzle just to keep the puzzle together.
You know, not every monorepo is Python, but in Python terms, it could have its own pyproject.toml, potentially its own virtual environment. The nomenclature ironies of this is often the monorepo, I think, makes more sense when you are working with lots of small parts, right? Where the monolith, maybe it has a couple of things, but it doesn't depend real deeply. The more interconnections you have and the harder it is to manage those versions, the more something like this makes sense, right?
People really make a connection between isolated work on part of the system into having to have separate repository for that, which is completely not the case. Like you can actually have an isolated sub part of the repository, even if it's Git. Git doesn't have like, you have some modules and sub repos and all that stuff. But even like in a single Git repository, you can easily have like start working and focusing on a small part of the whole monorepo and only care about that.
That's what the monorepo is. I'm going to go ahead and put it out there. I'm not a big fan of microservice architectures. I kind of find it's trading code complexity for DevOps and deployment complexity. And I think we have better tools to manage code complexity than DevOps complexity. But something like this does help you manage those kinds of deployments as well better, right? I use the term mini-series, not microservices. Microservices is just too much.
But then you can have a lot of mini-series, a number of mini-services, but not micro. Like micro was just too much of a mainstream. I can get on board with that. Amak, what do you think? I like that as well, mini-services. Maybe you should coin that too. It's the microservices that are too small. It feels to me like the equivalent of when you're trying to write unit tests and you're like, oh, what if I get a customer and I set their first name? And then I check that their first name is set.
Like, what are you doing? You don't need to check that assignment works. This is too, you're just too much in the weeds. You know what I mean? This is what AI agents do now all the time. Like, no. Yeah, think of the code coverage. Just think of the code coverage. Come on. You've got some goals to hit. You said 80% code coverage. It's on top of it. Yeah. That sets the stage. Let's talk a little bit about specifically how Apache Airflow has come to need this, basically. Right?
Like, you shared with me the pulse, the GitHub pulse for Apache Airflow. And it's kind of worth looking at just how much open source interest and traffic there is. Who wants to kind of summarize this weekly pulse here? This is not the best week in terms of the number of comments. We have had even more red, but in the week of... Wow. Just one of those weeks. Yeah. One of the usual weeks. Between Feb 3 and Feb 10, we have had about 310 active pull requests.
So, you can imagine that's about 40 plus pull requests a day. A lot of them are being assisted by the AI revolution going on, but that's a lot of pull requests. And we have merged about 200 of them. About 100 are open. And similarly with issues, right? 35 new issues. Five issues per day. That's a lot of traffic. So, you can imagine the amount of review pressure each of the maintainers has here. There's 300 pull requests spread across, I don't know, 120, 130, maybe 140 distributions.
And each of the distributions having like a swim lane owner who is actively trying to take a look at these pull requests. So, it's just another week to be very honest. It's more than 25 PRs a day, including weekends. How many of these people are high value? How many of these PRs are high value? I guess I'm trying to get the sense of like, how much does this get accepted? Are these just people throwing stuff out there that doesn't make sense for the direction of airflow?
Well, those merged all make sense because they are reviewed and merged by airflow maintainers. And we are very serious about that. So, like we don't merge anything that doesn't pass our bar, which is like very high and extremely high. Like we have 170 track hooks which are checking if the PR is doing what we, if the code is doing what it was supposed to be doing and if it's architected properly.
And on top of that, we have individuals, people like, like among myself and maybe 50 other PMC members and committees who are reviewing it and making their comments and know the system enough to direct people. So, they may make sense. We do have recently, and that was a recurring them at the FOSDEM conference last week when I was there about like AI generated contributions. And many of the AI generated contributions are not the best quality. It's not like AI is bad quality.
Many of those are easier to produce and they might have bad quality. So, we are now learning how to filter them out and how to make the, to handle them quickly. But those are the actual high value PRs that we merged. In terms of numbers, if you, if I may, the, it would be maybe a third of the open pull requests that are nice general trend. That's pretty good, honestly. Yep. We have some guidelines published very recently. And due to that, we have seen a dip in such, such quality of PRs.
We published some guidelines in our contribution guides about what will be the action taken if, you know, bad quality PRs are raised or non or PRs are raised where the author does not know the context, but the AI does. I don't want to go down this rat hole. People hear this enough lately, but I just, it's been in the news lately. Open source projects have been kind of getting a barrage of AI submissions. And I think that comes in a couple of flavors.
One, people who just want to get their name listed as a contributor, maybe it helps them with their job or whatever. So there's like a small incentive there, but it's been really bad for bug bounties. Like curl closed its bug bounty program because people were trying to make the 50 or $250 by finding some issue with AI. Is that a problem for you all just taking the pulse of a big project like that? It is.
I actually had a talk about that at the Global Vulnerability Intelligence Platform Summit just before Fosden. So that was exactly like, I even quoted Daniel Stenberg and I met him there at Fosden, which like, that was really cool. There are some different motivations of people who are submitting those those AI issues and we should fight with the in different ways with different approaches or like, you know, the respond to those motivations. Somehow we have some ideas.
We have an open discussion in GitHub maintainers list right now. And GitHub is trying to address it by like just discussing what they can do right now. And that's the highest priority for them. We have a discussion with OSSF for security kind of guidelines or policies for open source maintainers, how to deal with those issues.
And I'm sure we will work out some ways and toolings and most of all processes and like being assertive is one thing, like just saying no when the report doesn't meet all the bars immediately. And, you know, directing people to the description is good enough of a, you know, barrier for, you know, getting kind of completely broken PRs because we have to just make it more expensive for the reporters than for the maintainers to diagnose the issues or decide if the issues are bad or good.
And I'm not necessarily saying that there's something inherently bad because AI wrote some of the code than a person. AI can write really good code better than a lot of people I've seen. But it has this sort of shotgun effect often of just like, I'm going to change all these files and it's not as focused and clear. A lot of times it just it doesn't it doesn't get the Zen of it. You know, Amag, what do you think?
It'll generate code, which it thinks is good, but we don't really know the ripple effect and we want to avoid such things. Such a long living app with lots of complexity. Right. And we all are using the AI for generating the code, to be honest, like so like most of my code. You should. Yeah, it's incredible. It's it's I pulled up this graphic here and I'll link to it in the show notes.
I just given people a sense, I got this little utility that I released this week called Tallymon, which like analyzes code and gives you sort of a more of a breakdown than just like this many lines or whatever. So I want to just highlight maybe you all like can riff on this a little bit to give a sense. So 100 or 1.2 million lines of Python, 918,000 excluding comments, maybe a little over counting the way this thing works, but still 200,000 restructured texts.
The one that really stood out to me, 81,000 lines of YAML and 16,000 lines of TAML. You guys, that's impressive. And you know what? Hat tip to just a just a sprinkle, just a hint of Java at 42 lines of Java. But, you know, almost a million, just over a million lines of code without comments. That's a big project. What do you think? What happened when you joined? I don't know. I think it was much less. You did contribute a lot.
You can imagine so because of the number of packages we haven't read the monorepo discussion from earlier. We have a lot of packages and the YAML might surprise you at first. But if you actually go and see why the YAML, it's mostly for our providers. So integration with other systems is something we call as providers. And the spec of the providers is written in YAML. And TAML, sure, will come to it very, very, very soon.
That's kind of why I pulled this up, actually, is the TAML aspect is quite interesting, which leave us with that number as we move on. 16,000 lines of TAML. That's a lot of pyproject.TAML going on right there, folks. Oh, yes. And lots of it is generated, actually. So like, because we actually generate quite a lot of the YAML and TAML that we have and keep it in the repo. So we don't want to regenerate every time. So like, we don't write YAML by hand.
Maybe we can start by introducing this by just giving a shout out to this series that you wrote over here on Medium. Yarek, modern Python repo for Apache Airflow, parts one through four. Yes, I initially started discussing this blog post idea with a few people. Like, you know, like people are busy and I couldn't get people like to write it. So I decided to write it myself. Well, with a lot of AI help, of course. It's not that everything is written by hand.
And when I wrote it, I realized it's like too big and I had to split it into four. But the idea was like to document what we've done because because I think that a lot of people are struggling with like monorepo versus multirepo or like how they should do their repository in when they are the project grows. And there were lots of discussions in the past, including here, one of the, you know, one of the podcasts of yours were monorepo versus multirepo.
And I can't remember who that was, but there was discussion about like going back and forth and like finding that people sometimes go back and then then go forth and like in different directions because there are different problems or approaches. So I just wanted to document the reasoning why we are doing it, like why it's possible now because of the packaging ecosystem maturing for Python and uv and other tools coming into the space.
And then the last part was like really the kind of a little bit innovative approach that we do where the tooling is still not catching up with what we need and what we, what we, what we did. So those are the kind of history why we are doing it. The, you know, the packaging, the automated verification with Prec. So that was the third part. And the fourth part was about like the, this chart, libraries, innovation, innovative concept, but we added for, for, for.
I'll link to the series as well as to a talk that you gave at Fostum that just got published, right? Yes. Yes. And, they are, they have amazing system of recording and publishing stuff. Like, like for the volunteer driven conference, thousand speakers. Oh, that was, that's amazing. That works. Like probably some automation going on there.
Let's talk a little bit about, I guess the problems that you ran into because initially there were some challenges with the standards and tooling not be there. And you actually, one of the takeaways, if people read the series or watch the talk is you actually had to work with some of the tool providers to make this possible. So not only is it like, well, the tools have changed what we could do this.
It's you all have changed the tools a little bit through, you know, working closely, like, Hey, we've got this 1 million line project with a hundred dollars. So not only is it a hundred sub modules or more help. Like it's just your tools to support this. Help me make this work. Right. What were some of the problems?
Let me start with this cooperation and maybe, you know, Amok can also explain like what was before and after, because like he experienced that firsthand as a, as a user kind of this kind of repository structure. But for me, the idea was like, I was working on it for years. Like when we went to airflow to five years ago, we, or four years ago, I can't remember. That's a long time.
And we didn't have all the tooling and we had to do pretty much everything that we do now with the, with monorepine uv by hand, by bash scripts by that time. By that time, by that time, crazy. So like, if you run it three years ago, the, the, your code, you would see more than 10,000 lines of bash code, which I wrote. But we, we since removed. We since removed. That is not joyful. That doesn't spark joy. That's why we removed it with some outreach internship actually.
And shout out to edit and, and Borna who were our outreach mentors who helped us to convert it to, to Python, which was really helpful. That's how it started. No tooling need because we grew, we wanted to have more providers, more integrations, and it already was quite difficult to manage if they are well part of single distribution. So we have to split into many distributions, 60, I think at the beginning. Now we have more than hundreds.
Now, when we did that, I, we had to do all manually and like working with that was like really cumbersome. Maybe, you know, like I can switch to, to Amok. So he can say like the past experience and new experience because like he experienced the change himself. Yeah. The, the past experience was scary to be, to speak the least.
Whenever I, switch branches or have to rebase for whatever reason, I had a nightmare, a very bad time trying to, you know, package things together and try to run something. And I think Yarek found me often, you know, ranting on the Slack channels that, Hey, this doesn't work. Hey, that doesn't work. What do we do? Now it's, it's very easy. It's, it's effortless, almost effortless compared to what we had years, maybe like five years ago, four years ago. Yeah. Amazing. How does GitHub deal?
I was the only one who actually managed the whole thing for years. And I was like overwhelmed as well when people have problems, of course. So then the change that we've done was not only with the tooling. And as you mentioned, we were actually cooperating with Charlie from Astral, Charlie Marsh and with Joe from FEC because we had this need. We had it implemented ourselves and then they could look at how we've done that and they could implement it properly in their tooling.
And we've been like exchanging the, you know, like Charlie was even interviewing me at some point of time, how we, how, what, what are our needs? So I have for a long time, I have this, this motto that the best way to foresee future is to shape it. And like, so we did shape the future by, you know, talking to those tool providers so that they can, or builders so that they could build it for us and work with us. And we helped them to test them and everything like that.
But also it was like listening to Amog and other contributors, like all the problems they had or like, and then when I solved it, I wouldn't, I wouldn't also own solve it with the new tooling, but we also engaged all the more people from the, from the team, like Amog and few other active contributors. And they were actually part of the whole process of conversion. And they are now part of the team. And now we can have this podcast while things are being broken in airflow right now.
And somebody is probably fixing it right as we speak. So like, not me anymore. So that's, those old, old things are really great. This portion of Talk Python To Me is brought to you by us. I want to tell you about a course I put together that I'm really proud of. Agentic AI programming for Python developers. I know a lot of you have tried AI coding tools and come away thinking, well, this is more hassle than it's worth. And honestly, all the vibe coding hype isn't helping.
It's a smoke screen that hides what these tools can actually do. This course is about agentic engineering. Applying real software engineering practices with AI that understands your entire code base, runs your tests, and builds complete features under your direction. I've used these techniques to ship real production code across Talk Python, Python bytes, and completely new projects. I migrated an entire CSS framework on a production site with thousands of lines of HTML in a few hours.
I shipped a new search feature with caching and async in under an hour. I built a complete CLI tool for Talk Python from scratch, tested, documented, and published to PyPI in an afternoon. Real projects, real production code, both Greenfield and legacy. No toy demos, no fluff. I'll show you the guardrails, the planning techniques, and the workflows that turn AI into a genuine engineering partner. Check it out at talkpython.fm/agentic dash engineering.
That's talkpython.fm/agentic dash engineering. The link is in your podcast player's show notes. How does GitHub deal with so many files and such a big project? Is it fine or is it a challenge? Except yesterday, where half of the time GitHub was not at the end. Except yesterday. Yeah, for people who don't know, yesterday morning, at least morning US time, GitHub was having a moment. Like, it was, I couldn't clone stuff. I pulled up the random page on GitHub and got the 503 Unicorn.
It was not good, right? Besides that, not excluding that time. The Unicorn is actually a little bit like looking kind of angry at you. That's one of the observations I had from yesterday. I saw it so many times that it's like, it doesn't look nice. But maybe GitHub. I agree. That's not a great error page. Like, some error pages are amazing where it's like, you know, the coyote fell off of a cliff. Woo! You know, like, that one just looks like it's angry back at you. Besides that, it's perfect.
Like, it works like seamlessly, no problems whatsoever with the size, with the numbers. Like, we are very, very happy in general. And of course, like, things like that happen. There is nothing wrong. Like, there is something wrong, but like, it's not like that, that it happens all the time. Not really like GitHub. It's super rare. GitHub is an incredible service. I mean, I know there's been some grief about the GitHub actions, but I put, that's a different, different conversation. Right?
So let's talk about, next, about how the package standards have changed and how basically some of those things have made it possible. And so in your talk, you pulled up a bunch of different peps, nine of them or something like that, that were about packaging, recently packaging standards and different things like that, that have made basically the structure that you're working with and the tools that do it possible.
Do you want to maybe highlight either of you, some of these things that stand out as, this one is really important. The one which is maybe not super related to Monorepo, but it actually helped us a lot, like the pep723, the last, all the, one but last inline script metadata, which is like one of the biggest successes and the biggest kind of usages I see from PEP implemented.
It caught up very, very quickly. It allows us to, you know, embed inline script metadata into, into the Python scripts, which is like something that we've been dreaming of for years, especially for this kind of tooling, the FCI environment, et cetera, et cetera. This is really, really helpful. So that, that's the one that I would like to highlight.
But I, you know, I read all of them like many times, all the peps and they are difficult things to read, to read and understand, but they were like, we actually did all that we could to, you know, be like fully compliant with the, not only with the specification of those peps, but also with the kind of spirit of the specification, because sometimes things are not very precisely described and there are some interpretations and stuff.
So we just, we just made sure, and this is our, our goal as well. Like we just make sure that all the PEP standards that are being published are actually very meticulously followed. And we just try to adapt to any changes that are coming in the environment. So we know how difficult it is if people are sticking to the old ways and like that's, that makes difficult for Python maintainers. Mark, any other thoughts?
This one is a particularly very important one for us also because it simplifies our pre-commit configurations where earlier we had to, you know, specify the dependencies as required. So like whatever the particular version was, but now it's all in the script. It's not, and the pre-commit remains as clean as it could just with the hook name and, you know, the regex for the file filter and minimal configurations for it to work well. And I think your dependency group is also the other pep.
I don't recall the name, but I recall the number. I think it's six. Oh, I can't remember all the numbers, but one of those. That would be 735 folks, 735. That's also particularly nice for us. We can define the dependency groups in our by-projects and it's, it's nice to, it's really nice how it works with uv. We're very happy with this particular dependency group as well as the inline scripts. Right. The inline scripts are cool.
I, you know, especially with uv these days, it really makes running some kind of Python code so much easier. It's, it's almost as if everything is standard library. I can give somebody a file. I can say the way you run it. No, no, no, no, no. Don't. I know it looks like you say Python, but don't say that. You say uv run this and then, and that's it. Like they didn't even have to have Python. They might need 10 dependencies and so on it, but it doesn't matter. Right. Yeah. And big standard.
It makes it also, you know, like other tools are doing the same or hatch run. That's the same. That's like, yeah, there is even like support for inline script metadata just released in latest tip 26. So like, it's all good because of the standards and not because a single particular tool does it in an opinionated way. So this, this is really, really, really cool. And there is one big benefit of those kinds of apps and this part, particularly inline script metadata. It's like, we have less YAML.
Yeah. You already have a lot of YAML, but less is better. We have a lot still. We can't come from that. It's better than it was. Yeah. And so the dependency groups are like, you know, for dev or for tests or something like that. Right. So you can say like uv sync or uv pip install, and you can say like thing bracket dev or something like that. Right.
The nice thing is about you think is that it sends the dev dependencies automatically without you even specifying that, which is like the best thing for development because you actually always want to have developer developing development tools with you. That's a good point. Yeah. That's really cool. That was the changes to Python itself through the peps.
But there's also tools and you've already mentioned some of them, both of them, but tools that make this possible, which I mean, I think uv has to be number one that goes on this list, right? Like uv has really done some powerful stuff here. Right. Again, Amok can say like, I introduced it, but Amok was the one to switch to use uv at some point of time. Yep. UV has been a game changer. I think we were using poetry before this or Hatch. I don't know. No, not even that. Just pitch. Just pitch.
Just pitch. Just pitch. Just the image. It's so good. I don't even remember the last, you know, game changing aspect that uv brought in was this notion of workspaces. It's something very simple. You can compare it very similar to, you know, a co-working space or something similar where it's a unified environment where multiple interconnected pieces coexist and they're very easy to manage. And that's something that eventually led us to splitting the whole repository across our distributions.
And that's the reason you see so many toml files. So everything has a by project toml. Everything defines the dependency groups it needs and development of a particular package is restricted only to its dependencies. So you develop it, you run uv sync, you can run your by test using uv and everything that is supposed to run with it is running with it. And any bad or, you know, cross imports are caught really easily. So I think the workspace feature at least was the most important one for me.
And obviously the speed that it brings with it. And that's impressive. It is. And I think this workspace concept, it's new to me. I'll say it's new to me. I don't know how new it is to other other folks. So you've got this giant monorepo and how many different conceptually different packages or projects are in there right now? 120 plus.
It changes by day because Amok is doing a lot to increase the number very, very quickly because we are just now in the middle of finishing some isolation kind of restructuring. And Amok is the one that that's why he's here also to lead the introduction of new packages that we or new distributions that we that we have like a shared libraries that we will talk about later. So we have a lot of those. Yes. I think this is super important to dive into and how uv makes this possible.
And I think you said also Hatch, you talked with Ofec, who runs Hatch as well about this, right? Yes. Yes. Hatch is also supporting workspaces, which are modeled mainly about what like after what uv has done. We haven't tried it yet, but I've heard it's very, very similar or even like you can use it as a one to one replacement in some cases or maybe even in all. But generally, I would love this eventually to become some kind of standard so that multiple tools are supporting this.
But but yes, there are a few other tools that we were considering before, but uv is by far the kind of like, yeah, well, we work together. We shaped it together with the uv team. So it definitely works well for us. Yeah. Amazing. So let me describe this a little bit and then you all can can actually introduce it. So the idea is we've got this mono repo with a bunch of different folders for the sections, right? Like airflow dash CLI or CTL and airflow dash core and so on.
And you'd like to be able to kind of just jump into one section and treat it as a top level project, right? It's got a pyproject.toml. It's got a source file, tests and so on. But the challenge is you can't just have a bunch of disconnected pieces like maybe airflow core depends on five other parts of it that are also themselves have their own pyproject.toml and different things. And you've got to set up, you know, set up.
If you jump into the airflow core, you've got to set up the environment just right to be working on those other parts, right? It sounds pretty tricky. So how does how does that work? Who wants to make sense of this for us? It works perfectly. Like it's super, super simple, actually. You know, like the whole thing about the uv is like its simplicity of the of not of the concept. The implementation is actually quite tricky. But the way how you use it is very simple.
Just go to the directory and run uv sync. That's basically it. This is the directory you want to work on. And it does exactly what you would expect it to do, which means that it syncs. It actually updates the or recreates basically the virtual environment that you're using with all the dependencies that this particular distribution needs and anything that it needs. As a transitive dependency as well.
So if it refers to another project project inside the workspace, it will also use it from there, not from like installed by by PR. So you can immediately start working on this because everything after uv sync, everything is exactly as you expect for this particular subset of the repository that you were on. And that's basically it. This is all. Like there is nothing more, basically. That's it. It works. And you can, when you run uv sync pytest run, it will do exactly what you want.
So in this folder, because it will also uv, uv run py test, it will do exactly what you want because even uv run will automatically sync the virtual and very, very quickly to the one that your project needs. And then it will just run pytest in this virtual environment and it will run all the tests in your project. And that's basically it. So it's like conceptually for the users is like, you don't have to do much, just uv sync. And that's it.
I think one of the big challenges here is how do different parts of the project know about each other, right? Yeah. You said that it, it, it's similar links the different elements in. The basic kind of workspace and implementation is just a workspace definition. So you have to have the definition of workspace in the top level by project. So there you have all of them listed. You have links to it.
They have described where they are and uv will read the by project from the top level and will, will know what they are. We'll, will know where to look for particular distributions. So that's the, that's the simple discovery and the way how we know that we are using it from the sources and not from the, from the IPI. But then like the shared libraries as, as it's like something that we added on top of it and the sim links are on the top of it.
And this is kind of extra innovative thing that we are doing for something else that we need, but you know, we can, we can talk about that now or like I'm not can talk about. This is really cool. So one of the things that happens here is these different slices or subsections of the monorepo PI project.toml that PI project.toml depend, defines its true dependencies and its dev dependencies and so on.
So when you go and jump into a section, it will, uv will basically realign the virtual environment with whatever dependencies are supposed to be there from those things. Right. So that means installing stuff, obviously, but actually what surprised me a little bit, not a lot, but like, Oh yeah, I guess it does do that.
That's cool. Is it actually uninstalled stuff. That's not explicitly put there, which I can imagine before that you could be like, well, this one part way down here depends on this weird library. And somehow I used to be over there. Then I went back to the, this other piece and then I came back and I forgot where that even came from. Like, why is that in my virtual environment? And like, how do I specify that? Probably juggling that was a big problem, right?
This, this like loading and unloading dependencies based on what part of the monorepo you're in. And I think that actually makes it really much easier to deal with like this, this type of code structure. Let me add to that one more thing, because it's also not only the dependencies that you might have from somewhere else, but also it's a cross dependencies between different distributions inside.
So for example, if our flow CTL does not use our flow core, if you go there and you think you will not be able to report and use any of the source code, which is in airflow core, because it's not a dependency of our flow CTL. So uv sync will not only uninstall the dependencies that you have, but also uninstall the source code that you have from other parts of the repo, which is a fantastic thing for us. And that was exactly what was missing before kind of isolation between those.
You only actually can from your source, you only can refer to the source code of those distribution that you depend on and nothing else from the monorepo. So this means that it's like you can slice and dice your repository as you want. So depending on in which the directory you are and when you run uv sync, you will have like subset, like the actual useful and the used subset from your repository.
And it can be completely different if you go to another directory, some of that can be overlapping, some of that can be completely different. Depends like which dependencies are defined. And this is like, this all magically happens, like by just defining the dependency in PI project. And uv sync will handle it for you in the workspace. It's like exactly the reason why it's so useful for developers.
It helped us in our vision to actually, you know, decompose the project into multiple parts and avoid the classic problem of coupling, which every monorepo faces at some point in their lifecycle, because everything is out there. Why don't we just, you know, have code leaks all over the place. So this helps us prevent that. And I cannot imagine a time how we did it earlier before uv. I don't know if we did it, but if we did it, it would have been a really tough thing.
Yeah, there's a bunch of tools that you can, linters and code analysis things you can run on your code that breaks down for these different modules and these layers. Here's like a directed graph of how this thing, and you can set up rules to say this should never cross that boundary, but these are just very, very vague things. And this setup actually makes it so it's not accessible to your code. If you didn't say it should be.
It's just built in exactly the definition of your distribution, which you anyhow have to do because like you have to define what the, what the dependencies are. And yes, we did something like that before. So we get a number of like rough rules or whatever. Don't import here, import here. We still have them for shared libraries, which we can talk about now, because I think this is an important modification of the concept.
So we do have some automated check for quality and for imports with Prec, our Prec commit hook implementation. But before that, it was just completely, completely like handwritten and unmaintainable. People will not, we're not actually updating it with all the distributions you couldn't really, you know, follow when things change. With PyProject Tom being the, for each distribution being the single source of truth, you don't have to do anything because the dependency is declared there.
And this is like the best part of, of uv understanding that and, and doing everything that is like reasonable in this case. The other major tool involved here was Prec, which it's a pre commit framework for running hooks, many languages, but especially Python relevant here written in Rust. So it pairs well with uv, I suppose. Oh yeah. It was inspired by uv as well. And, and Joe was mentioning, mentioned that, that he was actually contributing to uv before. Great. How's Prec show up here?
I feel like this is leading towards what you were hinting at earlier. It's a new name, Prec. So, yep. This allows us to do a few things which pre commit did not do, or, you know, did not accept as suggestions. So, one certain thing that Prec offers is obviously it's written in Rust. So speed is the obvious one is that we get. But apart from that, we also get this notion of it pairing well with uv in terms of modularized hooks.
Earlier, we had all the hooks in one place in that, in the top level pre commit YAML, right? And it was a big fight. It was really big. You can imagine. So, yeah. So this Prec allowed us to, Prec again, you know, it, it consumed the concept of workspaces here, I would say. So it allowed you to define pre commit hooks or Prec hooks within a module itself.
And this paired well with uv in the sense that when you have to run hooks that are bound to a certain distribution, all you have to do is check in into the, you know, the sub module and just do a Prec run. It will run the relevant hooks for that particular module. And the other, other thing that I really love about Prec is auto completion, which is not something pre commit had.
So you can imagine that something fails in the CI, you have to copy that and copy the ID and try to kind of backtrack it in your repo as to which one is failing. So it's, it used to be a nightmare, but now with the, you know, the tab completion, it's, it's amazing. Nice. Are you talking about like shell autocomplete integration? Yeah. Yeah. So, okay. I've seen. I have some story about that very, very short.
So like we actually tried to get out the completion for hook names with, with Prec commit, which was the predecessor of Prec. Like Prec was largely based on Prec commit, but somehow the author of it didn't accept even idea of us contributing it or actually had some very, very excessive expectations for that. And we, you know, discussed and like, there were like, other people were also trying to convince the author to do that, but they refused.
He refused basically and refused to accept contributions. Even when we spoke to Joe, that was like completely different stories. Like we need that. And next day it was there. Like it's like completely different approach. So, so this is, and then we said like, we need workspaces and like a few weeks later, because it took a little bit of time, it was there and we work together and we tested that.
And like, I raised, I don't know how many issues in the initial kind of pre-release version when, when we wanted to use it. So I think the collaboration and being, you know, working together, listening to your users and be responding and actually working as an open source maintainers together. This actually worked perfectly well here, both, both in uv and Prec.
And this is why we love Prec actually because, because we know we can rely, if something is not working, that it's going to be like, we can discuss and either submit a fix or, or, you know, Joe will do this or even like lots of other people can do it. Because there was a few features that we wanted and somebody else implemented it. And that wasn't Joe, they contributed Prec because of this openness and, you know, being able to accept the needs of the users.
That was very, very important part, like why we moved to Prec. Yeah. I think Airflow was also one of the initial case studies for Prec. It's a project of that scale. And if you kind of satisfy that project's needs, you are, you're pretty good with most use cases. I think that's quite both Prec and UVS. Yeah. Right there at the top of the Prec repo, it says, although Prec is pretty new, it's already powering real projects, you know, little things like CPython, Apache Airflow and FastAPI.
I know Hugo van Kameret from the release manager of Python. So we met at Fosdem as well. And like, he was actually listening to our Prec discussion and he converted, you know, CPython to use Prec because of the, of the needs they had. So like, it was all about, you know, people talking to each other, word of mouth and things like that.
You know, there's a feature listed here that just makes me jealous. One of the features of Prec is a single binary with no dependencies that doesn't require Python or any other runtime to be installed. Like how incredible would it be with Python if we had a, a Python --build app or something, you know what I mean? You can put it at your thing and you get something you could distribute. I know uv solves a lot, but you still got to have uv installed.
And then, you know, like this, that is a huge advantage of things like Rust and go and some other languages. It's both good and bad in some cases. So it's like, there are always trade-offs, different choice made by Python here. I don't think it's like the best choice for, for Python. I think Python being script language, it's okay to have, you know, like dependencies and especially like inline script, script metadata almost did it because you just, you know, can install stuff.
And uv also, and the kind of tooling is also doing all the stuff like uv install or uv tool install, whatever. And it would not only install the project, its dependencies, but also install Python that is needed to run it. So like all this is really a matter of two weeks and it has improved dramatically over the last few years. Yeah. I was pining for an option, not a only binary thing.
All right. So one thing I actually want to talk about going back to this workspaces thing real quick is what does it look like from a IDE or editor experience to work on this? All right. Like you've got Python projects, you've got maybe VS Code workspaces where you can pull in different pieces. How do you all manage that? I cannot talk for VS Code. I'm a Python user here, but we had to do a little bit of hacking, I would say, or more like a helper script for the IDs, right?
Because so we have a IDE helper script right in the repo and we recommend the users to run it so that the IDE knows what is where in terms of maintaining things, right? Because in normal projects, there's usually just one source, one desk at the top level, but it has 120 plus. And the helper script is, it does a pretty simple thing. It just auto discovers all the packages in the monorepo and adds this.
So IntelliJ and PyCharm both have a .IDR within each, a hidden folder within each of the projects that it opens. And it has a, and it supports XML like format for IML where you can define certain things. So this essentially does a very simple thing. It just, for each package, it adds the module slash source as the source root and the module slash tests of the test. It's as if you went through all 120 things and right clicked and said mark as sources root or something like that.
Yeah, we had this PyCharm script and then we have the same approach for VS Code. So we have another script for VS Code as well, which was contributed by someone who uses VS Code because neither me or Amog are VS Code users. PyCharm uses both of us. But, you know, communities also and like somebody said, OK, I'll do it. And there it was. And they tested it. And, you know, like that's, that was super cool actually. So, yeah, it works well.
Also, the, you know, a little bit of words, probably we don't talk, we won't talk too much about like the, we don't have too much time, but the shared libraries concept a little bit might maybe it's the right time to introduce the concept. Because, because we like one thing that Amog mentioned is like the, we have, we solve this coupling problem, but also we wanted to solve the dry problem.
And those two are always kind of mixture, like you get dry and then you get more dry and less coupling and like, like more dry and more coupling and like all these things are complex when you have lots of code interacting with each other. Dry being the architectural philosophy of do not repeat yourself. But if you're not repeating yourself, everything where if it exists somewhere, everything's got to depend on that somewhere and it starts to become more linked together. Right.
So it's a little bit of like a, eat cake and have it too. Like we want to have dry code and not to repeat it for like common utilities, like logging, configuration, whatever, all the things that are kind of common between all the different distributions. But also we didn't want to depend on a single version of those, because if we do, then it means that we have to make sure that the backwards compatibility is maintained.
Because like when we install different version of different distributions coming from different time of repository, they might use different version of those shared libraries. And like how to make sure that they don't have breaking changes and stuff like so this is all the whole level of complexity between like how to manage the dependencies there and manage versions, especially manage the backwards compatibility.
So we figured out that with some very simple approach, we tried a few different approaches, but like, like one of the approaches was using the vendor link library from pip and from Byton, no, from pip, from pip. And the second one, and that's the one we came up, we finally implemented, was like using Simlinks to share the code between different distributions. And that's a very innovative approach that I hope will make it into some kind of standard eventually.
So like we came up with this approach where we actually have cake and eat it too, like, which is like pretty amazing if you fought with like for years with this kind of common dependency issues that and backwards compatibility. So in our case, like the Simlink approach we have, it needs some pre-processing of by project DOM. Some parts of the PyProject DOM are generated to make it actually work. But this is all automated with Preq, which is like, we don't have to think about that even.
And once we do that, and once we create some Simlinks between different parts of code, like one library, one distribution is Simlinks in code from the shared distribution. The end result is that this code gets automatically vendored in during the building of the package, which means that we actually have the same library in different package, in different version, in different distributions. So distribution released a week ago will have a shared configuration from a week ago.
But another distribution will have the same shared configuration code from today if it's released today. And we can install them together. And all of them have effectively, like if they had a different version of the same library installed. It's as if the Airflow-CTL said it had a dependency on core and it pinned that version to something, but a different part of the repo pinned it to a different. And they can both kind of coexist. But it's actually all within the same code file. That's insane.
OK. And this is like largely, like it's nothing new. It's largely inspired by how the libraries work in C and like traditional kind of building code. Like you have dynamic libraries and static libraries. So this is like essentially equivalent of static libraries where you take the code of the version that you compile the stuff in and put it inside the final binary. And then it results like in Rust, the kind of single binary thing.
So it's a little bit like, so we have a little bit of this single binary by doing that in the sense that we automatically vendor in all the, you know, shared dependencies that we have in the same distribution. So it's kind of hybrid, but it's always like, so Rust is a little bit too far because everything is single binary. In our case, we have a bit of both. Like we can use libraries dynamically, but we can also embed libraries as shared inside the single distribution. That's very cool.
That's wild. Amag? Sounds like you were instrumental in this, Pari. That's the nice thing about the approach that was chosen, right? We all came together as a community on this one. And we had one email, DevList discussion one fine day that, hey, we want to achieve something like this, which more or less was something everyone agreed upon. So people started chiming in and we started trying different things out. The first one, obviously using the rendering tool from Pip.
Somebody did a POC on that, but it felt like it's going to be difficult to achieve that long term. And also it could be brittle. So Yarek came up with this particular option with Simlinks, which again was discussed within the community. A few of us picked this PR up, passed it locally, played around and gave the feedback. So I don't think this would be possible with AI in the sense that this has never been done before.
Or something like this, where a community comes together and solves a rather difficult problem, is something that makes me really happy. And also something that all of us are working towards a common goal while also bound by our corporate hats, right? Is something that is again, really nice to see.
We have about how 11, I think at this point, we have about 11 to 12 shared libraries where the main notion here is to reimagine Airflow as a independent server and more like a control plane and execution plane. What we did with Airflow three and this shared libraries is helping us achieve that model. And we have about 11 to 12 of them. And I think a few more coming very soon. But yeah, that's yeah, it's been nice working on the shared libraries. It's yeah.
Is this something that people can take and adopt into their monorepo if they want to live that life? Absolutely. Yeah. It's just that it's really like one or two kind of preq hooks which are maintaining the consistency. And like, so that you don't forget to add this symlink here and that kind of I project com definition here and or that hatch definition for the hatch link to actually embed your symlink code into the final distribution.
So like there are like a few pieces that have to be put together from existing libraries. So that's basically it. And once you do it, it's just that those are the funny thing is like those shared libraries are just standalone distributions. You can actually build them separately as a library as well. We could potentially even, you know, like just use them as library as well. No problem whatsoever because they are just standard plane distributions or any other.
We just happen to take the source code of it and then embed it in into the target distribution that wants to use it rather than, you know, link to it by dependency. So that's basically other than that. It's it's it's a kind of completely standard library and or standard distribution. And one one more thing that is really important to add here is like this also has a side effect, but I think a very nice one.
And Amo can confirm that because he has been doing a lot of that is like we actually come up with like way better internal architecture. Or because of that, because a lot of those shared libraries, they depended on each other, sometimes in a circular fashion. Sometimes it really dependent, like which import you did first, like what happened, like what was initialized. And I was like complete spaghetti of dependencies between generally independent pieces of functionality.
Right now, by having shared libraries, we are actually forcing ourselves to make it make them isolated. We are changing the way how we initialize them. For example, we are injecting all the configuration rather than using them from inside the library, because like configuration libraries and other libraries. So you don't want to depend on the other libraries. So it's and it's really nice. I think it comes.
The result is that really the architecture of Airflow internally is so much better because of that. So less surprises and explicit initialization is like something that we'll have to do rather than implicit initialization, initialization during imports, which which has always been plaguing as a big issue.
Certainly, it also allows you to imagine each component having an entry point, per se, where you have an initial starting point and it initializes everything it needs by injecting and calling certain factories, which makes a very clean for anyone visiting the project. Also, they look at something and they know the entry point very clearly that, hey, this is how it starts. This is what it initializes.
You know, it reminds me of like Golang or Java projects where they have a nice, nice main where in Python, Python, it's not really the same way. All right. Well, I think that's about it for all the time we have. I guess let's close it out with one final thought. Here's just people who are maybe inspired by your design, by the way you put together Airflow and this monorepo concept, especially Python people. What do you what do you say to them? Final thoughts here.
I mean, like there was always discussion. Like we had lots of discussions internally, even some of the teams members in Airflow. They let's split the repository into smaller one. Like let's make more of them because it's going to make things easier. I was always the monorepo fan and and I made a lot of work to make it possible. But that was a very, very difficult thing. It's changed. So like the reasons why you would like to have multiple repos are gone now if you're using the right tooling.
And only the benefits or mostly the benefits from having it in one place where you can test everything together and work on it together, remain. All the rest is basically gone. So for me, the discussion monorepo versus multirepo is already solved. Yeah, just do it. We it's not even. So personally, I've been using the read me that we have present in the shared libraries as a context for my ID. So it's turning out to be very nice for the shared library split, for example.
All I have to do is just provide it the context and tell it, hey, just just construct the structure for me and I can do everything else. So it's that easy. We have all the things in place. We are in the right area to do it. So just do it. Very inspiring. Thank you for being here. Awesome for this look inside. And it's Apache Airflow. It's on GitHub. People can go look and see. It's not just a talking vaguely about some internal project. Right. So people can go check it out. All right.
See you later. Thanks. Thanks. This has been another episode of Talk Python To Me. Thank you to our sponsors. Be sure to check out what they're offering. It really helps support the show. This episode is brought to you by our Agentic AI Programming for Python course. Learn to work with AI that actually understands your code base and build real features. Visit talkpython.fm/agentic-ai.
If you or your team needs to learn Python, we have over 270 hours of beginner and advanced courses on topics ranging from complete beginners to async code, Flask, Django, HTMX, and even LLMs. Best of all, there's no subscription in sight. Browse the catalog at talkpython.fm. And if you're not already subscribed to the show on your favorite podcast player, what are you waiting for? Just search for Python in your podcast player. We should be right at the top.
If you enjoy that geeky rap song, you can download the full track. The link is actually in your podcast blur show notes. This is your host, Michael Kennedy. Thank you so much for listening. I really appreciate it. I'll see you next time. Bye. Thank you.
