The AI-First Data Engineer: 10–50x Productivity and What Changes Next - podcast episode cover

The AI-First Data Engineer: 10–50x Productivity and What Changes Next

Apr 07, 202659 minEp. 508
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Summary 
In this episode, I sit down with Gleb Mezhanskiy, CEO and co-founder of Datafold, to explore how agentic AI is reshaping data engineering. We unpack the leap from chat-assisted coding to truly agentic workflows where AI not only writes SQL and dbt models but also executes queries, debugs, runs tests, and ships production-ready outcomes. Gleb explains why teams that master this AI-first loop can see 10–50x gains, how security/compliance concerns can be addressed with platform-native LLM endpoints, and why the role of data engineers is shifting from code authors to operators of autonomous agents. We dig into the consolidation of the modern data stack, the economics driving more data products (Jevons paradox), and why product thinking, domain knowledge, and cross-functional skills will define the next wave of standout data professionals. We also cover practical steps for leaders and ICs: modernizing off legacy platforms, establishing safe AI adoption paths, codifying reusable “skills” and context for agents, and building validation utilities that keep the inner loop fast and trustworthy. Finally, Gleb shares how Datafold moved to fully AI-driven software delivery and why “outcomes over tools” is the emerging model for complex initiatives like data platform migrations—and how this reframes data quality for the AI era, emphasizing broad data access plus rich context over brittle human-centric tests. 
Announcements 
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
  • Your host is Tobias Macey and today I'm bringing back Gleb Mezhanskiy to talk about our predictions for the impact of AI on data engineering for 2026

Interview
  • Introduction
  • How did you get involved in the area of data management?
  • What are the concrete steps that teams need to be taking today to take advantage of agentic AI capabilities?
  • What are the new guardrails/constraints/workflows that need to be in place before you let AI loose on your data systems?
  • How do you balance the potential cost savings and productivity increases with the up-front investment and variability in inference spend?

Contact Info
 

Parting Question
 
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements
 
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links
 

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Transcript

Tobias MaceyTobias Macey

Hello, and welcome to the data engineering podcast, the show about modern data management. If you lead a data team, you know this pain. Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one off tools instead of doing actual data work.

Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data while keeping it all secure. Type a prompt like build me a self-service reporting tool that lets teams query customer metrics from Databricks, and they get a production ready app with the permissions and governance built in. They can self serve, and you get your time back. It's data democratization without the chaos.

Check out Retool at dataengineeringpodcast.com slash Retool today, that's r e t o o l, and see how other data teams are scaling self-service. Because let's be honest, we all need to retool how we handle data requests. Your host is Tobias Macy, and today I'm welcoming back Gleb Majanski to talk about predictions for the impact of AI on data engineering through the remainder of 2026.

And so, Gleb, for folks who haven't heard any of your past appearances, if you could just give a quick introduction.

Gleb MezhanskiyGleb Mezhanskiy

Yeah. Thanks for hosting me again, Tobias. Always great to be back. Yeah, I'm Gleb. I am CEO and co founder of DataFold. I spent pretty much my entire career being a data engineer and building data platforms. I got a chance to build data platform for Autodesk using the new data cloud tools that were new at the time. That was over ten years ago. And spent a few years scaling data platforms at Lyft, which was a really great and insightful time because Lyft is an incredibly data driven business.

And then I built the third data platform at a startup building teleoperation for autonomous vehicles and had to deal with a lot of IoT data and telemetry, which was also really cool. And finally started DataFold, which actually

I've been working on for now I realized six years already this March. And at DataFold, we have always worked on automating data engineering. And pre AI, we focused a lot on automating different data quality workflows. And now with AI, we focus on AI automation, which includes providing specialized agents for certain things like automating data platform migrations and optimization

of data platform costs. And we also provide tools for anyone's agents to be more efficient with context and various specialized tooling.

Tobias MaceyTobias Macey

And so you recently put up a blog post on your company blog talking about some of the predictions that you have for the data engineering ecosystem now that we do have AI that is capable of performing a lot of operational and development tasks. And so that definitely changes the scope and skills

of what people who are actually working in these spaces need to be thinking about. And so I'm just wondering if you can just quickly run through some of the key points, and then we'll explore some of the impact that that has on how people should be thinking about structuring their day to day work and their systems to be able to actually take advantage of some of these shifts in the industry.

Gleb MezhanskiyGleb Mezhanskiy

Yeah, absolutely. And title of the post saying predictions is not particularly modest, but I can talk about how it came together. So AI has been kind of in our life for what, let's say three years at least. And we've all been using it in different forms, starting from Charge GPT obviously, and kind of trying to build things and automations and using coding tools. And I think it's all has been kind of coming and evolving.

And then probably around November, remember I had a kind of like an awakening myself where after the release of Opus 4.5 Biontropic and then OpenAI releasing their own version of the similarly capable model, if I remember that order correctly. I think there's been a very sharp increase that I felt personally in my workflows in terms of coding, both software and data engineering coding. And that was super profound on me over Christmas holidays.

I was able to automatically code. I wouldn't say wipe code because I think that's a little bit understating the impact, two new products for DataFold. And then I also tried

what's it like to do data engineering with a full agentic experience and I was completely blown away. I always thought of myself as someone who can write really good SQL and in my data, data engineers, I think that was a superpower and that's how you advanced in your career and that's how you got things done. And with the current capabilities of agentic coding, I just thought that this completely changes the job and the experience.

And I thought it's really important to reflect on what it means for data engineering. And at that time I didn't really have very concrete thoughts. So I thought, well, would just try to put them on paper and make them a little bit more structured. And yeah, I can kind of go through the biggest things that at least for me were important takeaways. So I think one is that right now it feels like the world is divided in Harry Potter into wizards and muggles.

And I was myself a muggle until I went full into agentic coding. And I think it's important to define what actually agentic coding is because there's so many different terms and buzzwords thrown around. So if you ask anyone, Hey, how are using AI in the work? Everyone would say that they use it at some capacity. But if you actually drill in and say, Well, how are you actually using it in your workflow? Let's say data engineer, Linux engineer. A lot of people would say, well, I use

ChatGPT or I use Claude or another LLM to help me write code. And then if you ask them, walk me through what you actually are doing. They would say, well, say I need to do analysis. And I would prompt the chat to write me a code and then I would run this code against my data platform, my database, and then get results and then maybe ask the chat again for suggestions. And this is actually helpful because writing SQL is tedious, but it's not agentic coding.

And the way that I would define agentic is that Magent is a AI system that is capable of achieving a goal by choosing the path for how to get there with the tools that you give it and the context that you give it. And the difference is that if you use an agent to accomplish the similar task, let's say building a new model in DBT Wireflow, is that an agent would actually be able to not only write the code for you, but execute that code against the database, get the results, evaluate the results,

put the code into, let's say, DBT model, run DBT, debug it, write tests, debug it, and then present you with the complete outcome. And the difference between that kind of workflow where an agent actually not only writes something for you, but executes actions in the context of data engineering, that means executing something in the database, is enormous. I would say 10 to 50 x relative to not manual work, but relative to, like, using just a chat experience or tab autocomplete experience.

And the reason is because it's a loop that just gets not just single task automated and everything automated. And in that environment,

the human stops being the bottleneck of having to execute things and evaluate things. We're just now becoming the driver of a really autonomous process. And this is something that has been going on for at least over a year. So Cloud Code was released, I think about a year ago, but it took some time both in terms of evolving the agents and evolving the models that power them to get to the point where this process can be truly autonomous. And so I think that that's the first prediction that I preface by saying it's obvious, because I think to many that tried it, it's very obvious. Agenting and data engineering will boom in 2026 and will just be the default mode of how most data teams are operating. But the reason why I think it's still important to talk about it is because back to the muggles and wizards, so many people haven't tried it. And even at DataFold, we had some of our

most impactful engineers adopting this workflow months after, like for example, I had opened it as a CEO. And it was really hard to convince some folks because they would be like, well, really like my workflow. I need to see all the code. I need to write all the code. And they had perfectly reasonable explanations for why that happens. But once you get into trying agent decoding, I think there's no going back. You're

never the same person again. And so I think that's why it's important to talk about it is because I still think that the majority of data practitioners out there, especially at big companies, especially in enterprise that's like a bit more regulated, a bit more conservative are just not yet in this mode of working. And I think that we need to change that very fast.

Tobias MaceyTobias Macey

Yeah. That's definitely one that's worth spending some time on because even just in the software engineering space, putting aside the differences between just writing the software and working in the data engineering space, which is a topic that I have covered on this podcast, I don't even know how many times, but just even talking to software engineers about the use of agentic coding tools, there's a huge degree of variance in terms of their adoption and their willingness to seed that much control to the agents. And that I'm speaking as somebody who is leading a team who is going through some of that

growing pains, and I'm definitely on the side of being very enthusiastic about it and trying to push people into it. So you still have those people who are saying, no. You're going to pry my IDE out of my cold dead hands. I'm not going to let an AI write the code for me. And I think that there's definitely a

a spectrum, and I think that a lot of people are trying to fight what seems to be inevitable. I mean, obviously, time will tell. I'm interested too in maybe digging into some of the distinctions of using AgenTex software engineering and getting the AI to write the code and run the validation suite and execute the tests and create that feedback loop and some of the additional

ways that we need to think about extending that workflow or integration points or various tools or MCPs or context layers that we need to incorporate to be able to bring that level of functionality into the data engineering space where it's the code working on the actual data that determines its effectiveness and correctness and just some of the ways that that is maybe underserved by the current suite of tools and focus from

the very active and constantly shifting landscape that we're currently trying to navigate.

Gleb MezhanskiyGleb Mezhanskiy

Yeah. Tobias, you actually raised a very good point about the resistance being kind of giving up the control. And I someone that has always been trying to get to the solution as fast as possible. I don't necessarily have that problem, but I do see folks that are way more

technically capable and have much deeper knowledge than I am having that friction more because I think they just hone their craft way more than I have when it comes to writing code. But I think that what's important to understand is and remember is that at least right now as a human, you're still in control of the process, right? So however you want to do review, whether you want to review every single line of code that AI writes or define the tests or define the QA process,

you can do that, right? It doesn't mean that you don't have the control of the outcome. It's just the whole process can be that much faster if you leverage agentic coding. And then the second point that you brought up is how does agentic data engineering differ from the same pattern in software? And you're right, I think the access to data is incredibly important because if I'm a software engineer,

I typically develop on a sandbox environment with synthetic data or with data that I come up with just for executing tests. I don't really code live on the system. Whereas if I'm a data engineer, it's usually the opposite. I am writing code and all the preliminary exploratory queries I'm doing, I'm doing this on production data because that's how I get the insight in terms of what the data contains.

And that obviously presents challenges for AI adoption because a lot of enterprises are rightfully hesitant and anxious about letting

AI, which means large language models that are hosted can be hosted in different providers accessing their proprietary data. Because if you look at the terms of service of different LLM providers, there's actually quite a bit of a range of their guarantees in terms of whether data and prompts are used for post training and evaluation, whether it's not used for post training and evaluation. And so if we have a coding agent that leverages a third party LLM and it has access to your database, that actually does present quite a bit of security surface area and privacy surface area that you need to be aware about. Now,

I don't think this is a good excuse not to use AI because by now, as of today, there are multiple ways to ensure that your agentic data engineering coding is perfectly in line with even the strictest possible compliance. For example, I think that all data teams, even in the most regulated industries, use a cloud data platform as their core sender of operation, whether it's Databricks or Snowflake or GCP.

And each one of those platforms offers their own LLM endpoints that are governed by the same terms of service as the rest of the platform. And you can use those LLM endpoints for agentic coding. You can use your even favorite agents like Code with the LLM endpoints that are hosted within Databricks or Snowflake for

coding. And that means that none of the data leaves your security perimeter. So LLM and data are all within the same security perimeter from the data flow perspective, but also from the legal perspective. And furthermore, we've seen data platforms like Snowflake and Databricks very aggressively roll out their own agents that are even more, I would say, out of the box ready and security compliant because

they just by definition work within the same environment and not using data for kind of training something that is completely outside the environment as far as I know, but obviously ask your lawyer. So I think I think that the maturity of those solutions has evolved for enterprise to be able to adopt those tools pretty aggressively. And I think that the adoption is lagging way behind the capabilities right now.

Tobias MaceyTobias Macey

Now digging into some of the second and third order impacts of using these agentic workflows for data engineering, there's also the differentiation

that we need to think about of not just am I using the AI to do the work of data engineering as far as writing the code and validating it, but what is the role once I move that agent off of my laptop and turn it into an always on mode and give it that goal oriented execution to say, you're actually going to live within the execution path of my data engineering, whether it is operational monitoring to do validation of data as it lands or

doing in flight data transformation to do things like adding structure to unstructured data, doing things like entity extraction, and some of the ways that that shifts even just the nature of the work beyond just I'm gonna write a bunch of SQL, write a bunch of transforms, and then verify everything is correct at the end of the day. Yeah, I think in terms of the implications

Gleb MezhanskiyGleb Mezhanskiy

on the data engineer's job, they are quite profound. And I think that the value of a data engineer as it has been for the past ten, fifteen years, as we've seen the rise of cloud data warehouses and big data in terms of writing the code, maintaining the code for data pipelines, is now shifting towards operating, like you said, agents or teams of agents that are performing different tasks that previously would be completely owned by human engineers.

And I think that that's a very important shift that data engineers and data analysts and analytics engineers need to recognize and be aware of is because if your current role is writing code and that's a current value prop, that will probably no longer be relevant over the course of this year. And it's just a matter of time. I And don't think the timeline is very long until

that type of skill will be completely eliminated by automation. But I don't think that means that we don't need data engineers or we don't need that many data professionals because the agentic data engineering patterns drop the cost of creating data pipelines, managing data pipelines, operationalizing them. And that means that the

business can do much, much more with their data. So it doesn't mean that like, Oh, okay, we will just do the same but with fewer people. I think that when I talk to, I would say more forward looking data leaders and CDOs at enterprise, I hear them being very excited about the new capabilities that previously they just weren't able to unlock because their

teams were completely bogged down doing the basics. For years, over a decade, all we talked about was how to deliver dashboards, machine learning model and maintain SLAs and data quality for stakeholders. And now because those things can be automated, the types of things that I see data leaders wanting to tackle is

incredibly exciting. For example, I've been chatting with one of our customers who runs data platform for a very large parcel delivery service. And they've been talking about how after

adopting a modern cloud warehouse and also embracing AI, they're now thinking, okay, we can go way beyond dashboards. We can actually create a simulation for a business so we can simulate every single parcel, we can simulate the bottlenecks. And then instead of being reactive with operational dashboards, being proactive.

So having solvers that just run our business based on the data. And so I think this means that the demand for data engineering as a way to deliver high quality data to power data driven decisions is going to actually grow. And in economics, is this famous Jevan's paradox that essentially says that if the price for a given

resource or capability drops, we'll actually see more of that being consumed. And you see a lot of talk about Jevan's paradox in the context of GPUs and AI and how the cost AI dropping and people will be using more AI. I think the same is true for the output of data engineering. Because it's going to be cheaper to create data pipelines, we'll see more data pipelines being created, more data products being created. Because I think historically data has been underutilized by businesses in terms of what's possible to do to run businesses more efficiently, and the economics will just create really strong motivation to to do more.

Tobias MaceyTobias Macey

Yeah. You're seeing the references to Jevan's paradox a lot in the software space as well of people having that debate over our software engineers going to be completely obviated and removed as a result of AI, but instead,

you're just seeing software engineers doing more. I I know in my own work, there are dozens of different little scripts or tweaks or improvements or side projects that I've done in my day job that I wouldn't have otherwise bothered with. I actually just recently had a project that's been waiting on the shelf for about two years that I did a full rate up on. This is exactly how I would do it if I had the time. It's probably gonna take a full time engineer about two months to actually do the whole thing and validate it. And then just last week, for whatever reason, that project came back to my mind, and I said, oh, well, I'm just gonna go ahead and throw the document at my agentic engineering tool. And within two days, I had it complete and validated, and now it's in production.

Yep. And so so I think one of the interesting side effects as well of the acceleration of capability and productivity, but also the broadening of who can do which parts of the workflow brings an interesting question about the role proliferation that the data space in particular has been seeing,

I think, even just since 2020 where the idea of analytics engineers and machine learning engineers and ML ops and LLM ops data engineers and pipeline engineers and SQL engineers, you've been seeing this fragmentation of specialization because the work is complicated. It does require a lot of domain and technical expertise to be able to do effectively.

I'm curious what impact you are either seeing or predicting on just the ways that we think about what the roles and responsibilities are for data oriented professionals and whether we will maybe see a coalescing of role definitions because every person can have a broader scope because of the fact that they're able to get the AI to take on a lot of the heavy lifting.

Gleb MezhanskiyGleb Mezhanskiy

Yeah. Tawai, I think this is an excellent insight and I do think that we will see something similar to what we've seen we have been seeing in the software world where there has been consolidation and at the same time, all of a sudden, the kind of product engineer and product manager, so roles that have been more focused on tying the business problem solving to the actual technical solution,

has been elevated massively because now if you're a product manager, if you're product engineer, you don't have to rely on a team of more specialized engineers to get what you need to get done. And I think that just like for software engineers, having more product mindset, wearing more hats, being able to think more strategically, interfacing with people. The same will be true for data space as well. I think that there will be less value

in being hyper specialized. For example, in my day of data engineering, we've had streaming data engineering experts who would just work on streaming ingestion pipelines. And then we would have analytics engineers just work on turning the data already ingested into then data products that are consumed by data analysts and data scientists. I think that we will see far more demand for cross functional specialists who can take a business problem and then solve it end to end from the very, very beginning, which could actually start in, Oh, we need new instrumentation and we need to bring in new data streams. All the way to, okay, how this now powers

the business through either humans making decisions or increasingly so probably machines making decisions about the business. And I think that has a really important implication on, again, how data professionals need to think about their career evolution. I think that the soft skills, the business acumen, the domain expertise will start to matter way more than highly specialized technology skills and product thinking as well. Because ultimately

the word beta product has been kind of en vogue in the past few years, but it never really truly picked up. I think now it's actually worth revisiting because everything we do as data practitioners, every single streaming pipeline or, you know, machine learning model, it all is in the service of solving a business problem. So it all is some sort of an internal external product that we're we're building.

Tobias MaceyTobias Macey

And the natural next question is if the nature of the work and the people who are doing the work collapses down to a smaller number who are doing more, how does that also impact the way that we think about the underlying platforms and infrastructure that we need where 2020, maybe starting in 2019, really saw the growth of the whole modern data stack that caused huge proliferation

also because of the fact that we had zero interest rates. So VCs were throwing money at everybody with an idea, and now we're in another

another phase where a lot of those early movers are getting acquired or put out of business because their adjacencies are being consumed by other systems. I think maybe one of the best examples is the work that's happening with Fivetran and DBT and SQL Mesh where they were all separate tools, they were all separate companies, and now they have all been, aggregated into one company that is trying to own more of the process.

And as the actual entities are that are interacting with all these technical layers cease to be human increasingly and are instead AI and agentic workloads, how does that change the requirements of what the systems need to be able to do, what the integration patterns are, what the surface area of that technology stack needs to look like to enable these agentic workloads to be able to execute more effectively and with the appropriate context.

Gleb MezhanskiyGleb Mezhanskiy

Yeah. Well, there's so so much to unpack here, Tobias. But maybe, like, yeah, let let's start with the consolidation. I think the the consolidation of the data stack is part of the more global phenomena. I think that the fragmentation that historically, like you said, been caused by a lot of the available funding I think the available funding is one of the causes, but I think the other is that historically writing software has been very expensive across the board, not just engineering, but also

good product managers have always been very rare and expensive and hard to find and hard to nurture and grow. And then for each product team, need to be staffed with great software teams to be able to ship pictures. And so for any single vendor to be able to, let's say, go very deep in a problem or expand into adjacent product

area, it always has been quite expensive and a risky bet at many times because you have to invest a lot, this hasn't been your focused area. Do you go there? How fruitful it's going to be? And so that's why many software renderers have been very focused on just their domains and core competencies. And that's why also we've seen a lot of startups pop up that have been solving things that were falling through the cracks among the larger vendors or focusing on niche problems. And now,

because the cost of writing software is drastically, drastically cheaper, you can experiment way faster, you can ship MVPs and test them way faster, and you can expand into adjacent product areas that create more value for your target user persona way quicker. And that's far less risky because of the whole compression of the shipping

cycle and costs. Just for example, with DataFold, we were able to expand into areas such as data platform costs optimization very quickly despite being a very small team and into the platform migrations here earlier that would not be possible at all without AI. And so I think that's kind of the expansion of the platforms and consolidation of the more fragmented market. That's definitely one force. I think the other forces that we touched on is do we need that many people building

data products, building pipelines? We're seeing a lot of layoffs happening at companies that are actually seemingly doing quite well. And so there's definitely a lot of anxiety around, well, do we need the main tech professionals? Do we need that many people in data space? And I don't think we know for sure. I don't think anyone knows for sure, at least not until It's hard to say that yes,

will need far, far fewer humans to just run everything Because at the same time, we see that companies that are doing really well and growing really fast, like the big AI labs and a lot of players in the AI space, they are hiring very aggressively.

What I think even though they have a lot of AI automation and arguably they're best in class in being able to ship and automate with AI. So I think that maybe a few things are true at once. Think that if what you possess as a data professional is what I would call a commodity skill, like writing SQL, it's just

no longer defensible nor remarkable, you are at risk. But at the same time, if you're able to stretch and combine multiple roles and you bring product thinking and you're great at working with people and navigating and understanding business environments, business context and business goals, then I think you will be in demand as much as ever because you can do so much with your impact can be 50x

and companies would value that. And I think that means that just the market becomes far less even or kind of uniformly distributed. Now we're probably gonna see a bimodal distribution of data professionals who are adopting very quickly and are very marketable

because they possess skills that are valued in the AI world. Then a lot of folks who are kind of lagging behind because their skills are no longer in demand. And I think that's what makes this whole labor market very turbulent for data professionals. And that's why I think it's very important to be in the first camp and not the second camp.

Tobias MaceyTobias Macey

Expanding a little bit on what you were referring to earlier as well of because we can move faster, because we can experiment faster,

I am no longer willing to settle for, hey, give me a static dashboard that I can look at when I remember to and instead moving to these more proactive use cases. And I think this is the overall dream of what we wanted business intelligence to be of, okay. Great. You've told me something. Now what? And the now what, I think, is more in the loop and more automatic and more exploratory.

And I'm wondering what are some of the ways that you are seeing some of the potential for these agentic use cases and asynchronous discovery that can happen, even looking at things like the Orion project from Gravity or the Compass project from Dagster where you just have this agent that's churning through your data asynchronously

and finding these little nuggets of insight to say, hey. Did you know this? Or, hey. I just found out this interesting fact. And then being able to take that and turn it into, okay. Now that I know this, what is the next step? And actually having some recommendation of here are the five things that you should try and then being able to actually have the capacity to do more of that experimentation of whether it's AB testing on a website or changing some of the features of a given product to

and I'm just wondering what are some of the ways that you're seeing folks leverage this unlocked capability and this the the fact that we're not spending so much time on toil and we can instead focus on these higher leverage elements.

Gleb MezhanskiyGleb Mezhanskiy

Yeah. I think the high leverage elements is really important here, Tobias, because there is also a lot of, I would say, kind of like shallow use cases where just having an agent come through your data and come up with things that look interesting doesn't necessarily mean that it is impactful for the business. I think ultimately, the value of the data comes from the

value of decisions that we're making based on that data. And that value could be also quantified through risk. Like If the business is thinking about whether to expand or kill certain products or to invest more money in this particular acquisition channel for its customers, then if the data helps you reduce that risk, that is very quantifiable value.

If we're being a little bit more academic from information theory, and I think that it all has to come down to what is the business problem we're trying to solve and what's the best way to solve it. And I think that if we identify these bottlenecks that are currently very manual or decisions

made not based on data and better data can help us reduce the risk of those decisions and ultimately get the business more efficient and grow faster, then I think the automation possibilities are completely limitless. Because like you said, we can go from the world where a human will look at a dashboard, make a suggestion to another human who would then make a product decision, who would then write the code and change something about your product or propagate the decision through organization.

Now you can have an agent that evaluates the data and makes the decision immediately. Those decisions can be quite diverse. So that kind of automation is not new. For example, ride sharing businesses like Uber and Lyft pioneered automatic decision making and balancing decided markets. And they have been incredibly, incredibly data driven even without AI. But the types of decisions that could be automated were limited to just high frequency

use cases like driver passenger matching and algorithmic pricing. But now we can automate way more people heavy and diverse processes ranging from support to getting your marketing resources and in general optimization of the entire business. But again, I think that it all has to come down from the business use case rather than from kind of, oh, let's agent

figure out things on its own. I do think that occasionally we can stumble on a treasure chest in the data if we let agents lose, but I would be surprised if that is a systematically winning paradigm than coming from, okay, actually what business needs? I think the other important aspect here is that there's a lot of obviously, agenda coding is important, but it's still a problem to supply these agents with the context

because a lot of the context exists outside of database and outside of immediate code base that the agent has access to. It exists in people's heads. It exists in email and Slack and Teams and documentation tools and Excel spreadsheets. And so I think we're still figuring out how to collect all this context and feed that in the agent so that the agent can actually work with the data efficiently. And it is a real bottleneck that I think that we will see being solved over the next year.

Tobias MaceyTobias Macey

And so we've discussed all of these wonderful exciting futures that we're looking forward to. And so as an individual contributor or as an engineering leader, what are the things that I need to be thinking about and concrete steps that I should be taking today to make sure that I'm able to actually realize some of this promise

and not just get bogged down with all kinds of bugs or errors or problems that are introduced because somebody else with AI is spewing all kinds of problematic code and data into my systems or just ways that we should be thinking about as professionals? What are the skills? What are the day to day practices that we need to be investing in to be able to build that flywheel of letting AI do more of the drudgery and move beyond just, can I do something locally on my machine to what are the confidence building steps that I need to take to be able to actually let the AI run-in that inner loop of my data system?

Gleb MezhanskiyGleb Mezhanskiy

Yeah. What a great question. Well, maybe we should start a little bit with the basics foundation of infrastructure. So in the startup world, everyone runs on really exciting tools and there is lots of great tools in the modern data stack today that are just very easy. And then also they are AI first and very friendly to agents, have MCPs and everyone can move fast and be heavy. But the larger world, the most of the world, data world still runs on legacy data infrastructure.

So much so that if you talk to the leaders in the data space, like the leading data platforms, they still estimate that there is fifty to one hundred and fifty billion in data platform spend going into legacy tooling. So those billion dollar companies themselves, they think that there's like 10x to 30x more workloads that are currently locked in legacy platforms. And the risk is that if you are a data team operating on a legacy platform,

you can't really take full advantage of AI because those platforms are not AI native. It's really hard to get data out of them. It's really hard to ensure interoperability with the modern tools. Your data stack is fragmented and that just slows you down. So I think that's just a very important foundation. Make sure that you're running on the modern data platform because otherwise you're going to be fundamentally slowed down. Now, the good news is that with AI, migrations

to modern platforms are far, far easier and DataFold has kind of pioneered the software first approach to data platform migrations, but we are obviously not the only ones doing this. And it's obvious that with agent decoding, moving code, which is the primary cost of migration, has become much, much easier. And with the cost of data platform migrations and the timelines

really plummeting, I think there is just no good reason to be stuck on the legacy infrastructure because I think it will present a very substantial long term risk for your business if you do so. The side effect of this is I think that legacy data platforms are completely cooked like ETL tools and on prem installations because

the only reason why they're still in business and still have all these enterprise customers running those legacy software is because of the migration friction. And if that's going away, then I don't think they have any chance to stand against the modern players. So that's one. So first, make sure that your foundations are solid as a data leader. I think the second thing is even, like I said, even at a startup, ensuring that your team takes full advantage of AI is challenge. It's hard. It's hard because not only you're fighting some inertia and

people having their own ways about the workflows, but everything is changing so fast. And today you think like, Oh, this coding agent is the state of the art. Tomorrow, a new model comes out and everyone says, This new thing is state of the art. And so there's a lot of noise, there's lot of confusion. And then like you said, there's obviously risks in terms of you don't want to blow things up and there are real security and privacy risks. And so I don't think that there is any perfect answer,

really embracing this new world and investing and learning about it and trying things out. I don't think there's a perfect recipe of do this or use this agent or use this model. I think everyone needs to try for themselves and learn what works, what doesn't, and also invest in ways that allow them to do AI data engineering, AI data product building safely and securely.

And basically, you need to bring agency to your team, no pun intended. There is no one that will magically provide you a solution, I think, this year at least that will just work and solve all of your problems. You'll have to figure it out. But that process of figuring it out is important. I've seen that teams that invested in it that encouraged their teams to try things, figured out a way to securely deploy

AI and let people try it, enable engineering coding. These teams are moving way faster and the gap between teams that are proactive versus just waiting and not investing in education and trying things and piloting new ways is growing rapidly. So I think that's really, really important. And then if you're a leader, we've talked about what are the career

implications for, let's say, individual contributors. But for data leaders, think it's also quite important to recognize that if you're not riding the AI wave right now, and I don't want to sound buzzword, if you're not embracing AI with your team, your job is at risk because at some point, the leadership of the organization will recognize that you're slowing everyone down. That's kind of a negative way to say it. But the positive way of saying it is as a leader, you can multiply your impact by 10 to 50x if you invest in making your team fully AI enabled. So I think that's just a very kind of black and white world that we're embracing just because of how disruptive this technology is.

Tobias MaceyTobias Macey

And I think too, some of the concrete steps and ways that we should be thinking about how to keep that flywheel moving and get it moving faster is as you're doing your day to day work, if you are interacting with an AI, whether it's Claude code or GitHub Copilot or what have you, anytime you find yourself repeating something,

that is a signal that maybe you should start to codify that either into an agents. M d or a Claude dot m d that lives in the repository or codifying it in a skills dot m d so that the agent can incorporate that context for that particular style of workflow and just documenting those

workflows and the ways that you work and the types of work that need to be done so that you're not repeating yourself every time and so that everybody on the team is able to take advantage of those context cues collectively rather than it being a single player mode where you as an individual contributor have figured out all of the tricks, and so you're able to move at 50 times your regular pace, but everybody else on your team is left behind. And just even just getting the agent to generate

those skills files or agents files to say, hey. Make sure that this gets added to the team context or wherever that might live and just thinking through what are some of the ways that you can accelerate that workflow and reduce the need to do that constant rediscovery of best practices and then even letting the agent loose on your code base to say, hey. Tell me what are some of the patterns that we have established, codify that so that it's easier to understand,

and then you can determine is that something that you want to invest in going forward or not. But you can use these tools for more than just writing the code. You can use it for understanding it, doing some analysis of maybe what are some of the areas of duplicative effort that we have across these different tool chains and how and then the other key piece is use those agent decoding tools to help you write more utilities to give you better and faster feedback. So for instance, I'm using a combination of DBT and Superset for business intelligence.

And one of the challenges that we've been going through recently is I wrote a utility that lets me actually build a better work workflow of being able to go from QA to production with superset instead of just being point and click. So there are a bunch of YAML files that are fairly opaque and inscrutable to a human, and it's hard to tell is everything lined up. So I wrote I I got Copilot to radio utility to say, hey. Here are all the YAML files.

Write a validation script that will look at the dashboards, make sure that all the charts that it references exist, all the charts that are referenced, all the columns that they're looking for actually exist in the datasets and that those datasets are actually appropriately pinned to DBT models so that I can have an end to end confidence building

exercise before I ever ship it and just being able to build some of those tools where the agent could actually execute that as it's making changes to say, hey, am I going in the right direction or did I just break everything and I need to back up? Yeah, well, I think Tobias, you are here illustrating

Gleb MezhanskiyGleb Mezhanskiy

a very important point which is that there is effort and there is skill and there is craft in mastering the AI first workflow. And I think this is one of the arguments that I hear thrown a lot is that folks who are maybe more resistant to using AI

are labeling people who are very AI first as lazy because like, Oh, sure, you'll just tell AI to do stuff and then there's no craft. I actually think it's the opposite. Coming back from the Harry Potter analogy, yes, you can become a wizard but you have to go to Hogwarts first. It's not something that you can magically wave your magic wand and then all of a sudden great things are happening.

Even though the bar for making great things is definitely way lower because of thanks to AI, I do think that there is a skill and you have to invest time and energy and learn how to leverage AI most effectively. And back to our question about what differentiates the kind of most successful data practitioners who are gonna be leveraging AI and will be very relevant in this new economy versus those that could struggle, I think that the AI mastery is a craft and skill that

actually is quite defensible and important in the market. For example, I can't imagine any effective data team at a fast moving organization today who wouldn't evaluate new hires in terms of their AI skills. I personally would ask questions, how have you been using AI in your workflow? What have you built to improve your workflow? What have you done to enable your coworkers to improve the workflow? Just to your point, right? Because even though it is magic, it doesn't come necessarily

easy. You have to also invest in understanding how it works, invest in education, invest in building tools and scripts for yourself. And I do think that AI improving will

probably make some things easier. I think some of the patterns that existed a year ago right now are not necessarily relevant because AI can figure things out more on its own. But I still think there's always gonna be something to learn and to master that can differentiate you from everyone else who hasn't invested in learning and mastering.

Tobias MaceyTobias Macey

Absolutely true. And one of the other at least perceived barriers to entry that can happen, particularly if you are in a company that doesn't have free access to unlimited compute or an expansive budget is that there is a nonzero amount of cost involved in using most of these leading edge AI tools. So Cloud co Cloud Code Max, you're talking about $200 per user per month, which is in the grand scheme of things very affordable.

But if you have a large team, can be quite substantial. And I'm just wondering what are some of the ways that you're seeing teams address some of that consideration of how do I justify this initial expense before I'm actually getting all of the benefits and just some of that balancing act of the catch 22 that you're in where I want to move faster but I can't afford to but I can't afford not to because I have to move faster.

Gleb MezhanskiyGleb Mezhanskiy

Yeah. I think we're starting to see the cost of LLM inference be quite substantial. So for example, a year ago when our customers asked us, well, how worried should they be about LLM inference for data fold features? I would say, well, if you're a multi tenant environment, don't worry about it because we pay for it. If it's a single tenant environment, we use LLM and points, That's negligible.

Now it's a very different story because of how much we were able to automate and how much we actually are consuming in terms of LLM costs so that we had to establish kind of LLM FinOps at DataFold because of how significant those costs have become. And I think pretty recently they surpassed the costs of overall infrastructure. All of the infrastructure combined versus the LLM costs, the LLM costs have become more substantial, which I think tells about just how much work is

actually being done. The other thing I'll say is that it shouldn't really stop you from automating your work. It certainly doesn't stop us. And I think that if you're running a team of engineers or data engineers and team, all team members are on, let's say, ClotCode Max Plan at $200 a month and they're reaching limits on those plan, well, congratulations,

you're running an extremely efficient and impactful team. That's the world we want to be at. I would be far more worried about a team that uses a couple of $15 a month subscriptions for the tools that don't help much, just because again, how much leverage it gives us. Think relative to the cost of engineers, relative to the cost of our attention and our time as humans,

this is still low relative to how much we can actually do and how much leverage we're getting. And then again, that's not to say that there could not be completely wasteful AI spend. So if you task an agent with some poorly defined task without guard rails and using the most expensive model, because why not, then well, I wouldn't be surprised if you recap a very large bill. That can happen, happen to us, and will continue to be happening. But I don't

see it as being prohibitively expensive for the industry. Furthermore, the advancements in the models and in the efficiency are also quite rapid. So we're seeing open source models like Quen that are maybe lagging behind the frontier, most intelligent models like Opus being quite on par in terms of baseline coding. And those models you can run on, let's say 32 gigabyte Mac these days, not even the top line Mac. And then larger models can fit into more upgraded

machines. And I think these are just one off data points, but they are signals that at least for the individual productivity standpoint, LLM costs are far from being prohibited for modern organizations. And I think there's way way way lots of things that we're spending on that are providing far less value than than that.

Tobias MaceyTobias Macey

Absolutely. So as you have been going through this journey yourself where you have moved from being a data engineering company as these AI models became more capable, you reoriented your product vision to be AI native and brought the LLMs into the inner loop of what you're offering, and now you're also using it

extensively for doing the coding and data engineering within your own company. What are some of the key takeaways that you want to make sure that folks are listening to and aware of and maybe some of the other companies or individuals that you're looking to for inspiration who are maybe a couple of steps ahead of where you are?

Gleb MezhanskiyGleb Mezhanskiy

Yeah. Well, I can speak for the transformation on DataFold. So in terms of how we use AI internally to ship our own product, over the course of past couple of months, we shifted from most of the code was

written by software engineers and then Wassamia Automation to none of the engineers actually writing code anymore. They're writing prompts and having conversations with the AI agents, different kinds of agents for writing code and then QA ing the code and then operationalizing that and managing infrastructure.

But that has been a very, very important shift. And then we have completely autonomous agents that bring up the whole software stack in the cloud, build a new feature, bring up a preview with a URL that anyone on the team can look at, test everything. And that's pretty much the entire software engineering loop being automated end to end from a, let's say, a linear task all the way to production.

And I think that that's the future for sure. I think there's more that we can automate, but it has been profoundly impactful on how much we were able to ship as a company. Externally in terms of what we were able to do for customers because of that is, I think one of the most interesting

changes for us as a company has been that pre AI we were selling tools from a SaaS model to data teams. And they would use those tools to be more productive. Let's say more productive at validating their data, more productive at discovering their data, more productive in terms of communicating with the stakeholders about their data. And AI enabled us not just to make those tools better, but to also start offering solutions for customers that instead of offering a productivity gain,

offer a complete business outcome. And an example of that is a data platform migration. So historically,

to execute data platform migration, you would hire a team of consultants and then they may bring some internal tools that would help them translate some of their code and then there were anything they would do by hand, or they would use some of the native tools from the data platforms. But all in all, it was kind of like an extremely manual effort and the market was kind of separated into software companies building so called accelerators

that would then sell that software to service companies who would then use accelerators to perform services. And AI enabled us to combine both in one offering where we have our own software, which essentially is a team of AI agents that have different roles that we deploy to provide migration as an outcome. So in that model,

there is no human billable hours that we need to sell, yet we're able to provide a full service where the customer gets migration done, completed as an outcome. So we're kind of replacing both the service plus the tool with full AI automation.

And there are other use cases that we are going to launch that are following the similar model where instead of buying tools that make your team more productive to accomplishing certain tasks, you can just buy an outcome of the task being done. And when I say task, I mean a very complete large business outcome. For example, to execute a large data platform migration historically has taken one to three years at the cost of millions of dollars at enterprise. Now it can be done in weeks

at a fraction of the price. And there are other problems like that that exist in the enterprise that can be solved in a similar model. So I do feel that

that's just one example of one company operating in the data domain. But I do think that this model where you can replace services plus scattered tools with a single offering and sell an outcome that's way faster, way better, way economically efficient with full end to end AI automation, I think this model has a lot of promise in this world and I would expect to see more problems being solved like that.

Tobias MaceyTobias Macey

And are there any other aspects of this shift in capabilities and workflow in outcomes and just the overall predictions that you have as we continue to accelerate into this uncertain future that we didn't discuss yet that you'd like to cover before we close out the show?

Gleb MezhanskiyGleb Mezhanskiy

Yeah. I would say one thing, it may be worth a deeper dive, but I think it's still important for folks to kind of dwell on because it certainly was a big realization for me. I I do think that the way we were thinking about data quality in a pre AI world and post AI world is very different because if you think about the last five years, there's been a huge

conversation in the data community about how do we ensure data quality? How do we make sure that the products we're delivering to the business are correct? There has been multiple frameworks that were started, open source and features and existing frameworks to tackle data quality. Multiple companies got funded. Hundreds of millions of VC dollars went into the space. DataFold,

no exception to this. Data quality has been entire focus area for us for the first three years of the company. And I think all of this is now irrelevant completely because the way we thought about data quality before was, okay, how do we instrument all these different rules and checks and tests and assertions about the data so that we can signal to data consumers that this data is good? But it's always been a moving target because data is changing, the business requirements are changing and

I don't think we ever got to the point where a data team would be like, Okay, great. We've invested so much in data quality and we are perfectly happy. I think what happened actually is that we learned how to work with imperfect data over time. And I think that in the AI world, this is irrelevant because ultimately all the solutions we've been building were built for humans in mind, and humans have very limited context and very limited capacity for processing.

So how can we curate one dataset that can describe this thing that we know is gold? It's good. Humans, please use it. Don't use anything else. Just this dataset. And make sure the task coverage is great. I think with AI,

the whole thing flips upside down. You want AI to have access to all of your data, even the data that's imperfect, because none of the data is perfect to start with. And then have AI actually figure out how to provide the better quality answer to your question, considering everything and considering the various degrees of reliability or accuracy or precision of your data points. Now, that is not to say that AI will do a marvelous job

if you just throw all the data that you have at it, right? I think this presents its own challenge. But I do think that the way we're thinking about data quality is different in the AI world. What matters in the AI world is that A, AI has ability to act. So we talked about agentic loop, execute queries, evaluate queries, run tools, run tools like dbt. And two, AI has access to all of your data because again, the more data points you have, the more complete picture you can construct. But three, you also need to make sure AI has the right context on that data. So

what does each dataset mean? Where it comes from? How does it relates to the business entities and the nature of the business? What about those data sources and how they all interlink? And that third component is actually currently missing. And I think that this year is going be a big year for figuring out how do we feed the right context into the agents so they can actually figure out how to solve end to end problems with us, considering that all data is

imperfect. And it's also a very big investment area for us at DataFold, because we had to build what we call data knowledge graph in order to power migrations. And we're going to be offering that to all customers to

help their agents provide more context. But in general, I think that's the really big shift from let's write a bunch of tests for humans and curate data for humans and kind of focus everyone on very small pieces to let's embrace the full complexity of data, but figure out how to make AI work reliably with it. Yeah. Definitely a valuable

Tobias MaceyTobias Macey

thing to be thinking about where we should be spending our time most effectively and what are some of the pieces that the AI doesn't care about and we should maybe just seed control of to hark back to our earlier points. So for anybody who wants to get in touch with you and follow along with the work that you're doing and the rest of the team, I'll have you add your preferred contact information to the show notes. And, as the final question, I'm sure that your answer has changed a few times since we last spoke, but what do you see as being the biggest gap in the tooling or technologies that's available for data? And I'll add AI management today.

Gleb MezhanskiyGleb Mezhanskiy

I would say that right now, I firmly believe that tooling capability exceeds our ability to adopt those tools. And I think that instead of trying to find the perfect tool, now is the time to experiment

with what's available and build your own workflow. Just like you gave an example, Tobias, how you've been using coding agents to write scripts, to automate the workflow and kind of build skills to make those agents more effective at accomplishing more and more tasks. Now is the world where you can fill your own gaps, you can build your own tools. And I think that's a very important mind shift that is still very, very few people in the data community fully embrace, and I would love for everyone lean on this as hard as possible. Go and build your tools,

make your workflows better, increase your quality of life at job. There has never been a time like this. And that's that's what I think is missing. Go build your own tools.

Tobias MaceyTobias Macey

Absolutely. Well, thank you as always for taking the time today to join me and share your own thoughts and experiences of living out this weird and exciting future that we're all barreling into. So as always, I appreciate you taking the time, and I hope enjoy the rest of your day. Thank you, Tobias. I will let you go back to your agentic coding. Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used. And the AI engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about

Email hosts at data engineering podcast dot com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android