Hello, and welcome to the Data Engineering podcast, the show about modern data management. This episode is supported by CodeCommence, an original podcast from Red Hat. As someone who listens to the data engineering podcast, you know that the road from tool selection to production readiness is anything but smooth or straight.
In code comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard won lessons in implementing new technologies. I listened to the recent episode, Transforming Your Database, and appreciated the valuable advice on how to approach the selection and integration of new databases and applications into the impact on team dynamics.
There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for code comments in your podcast player or go to data engineering podcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex.
For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end to end data lake has platform built on Trino, the query engine Apache Iceberg was designed for. Starburst has complete support for all table formats, including Apache Iceberg, Hive, and Delta Lake. And Starburst is trusted by teams of all sizes, including Comcast and DoorDash.
Want to see Starburst in action? Go to data engineering podcast.com/starburst today and get $500 in credits to try Starburst Galaxy, the easiest and fastest way to get started using Trino. Your host is Tobias Macy. And today, I'm going to be talking about my experiences managing the QA and release process of my data platform. So for people who have been listening for a while, I have been sharing different stages of my journey of building out a new data platform for my day job.
And as a quick recap, and I'll add links in the show notes to some of the previous episodes where I've talked about this, but my architecture is focused on a lakehouse oriented approach. So I'm using Airbyte for data loading, s 3 for storage, Trino via the Starburst, Galaxy platform for the query engine, dbt for transformations, and Dagster for overall orchestration.
I have also recently been integrating Superset for the business intelligence and data exploration layer for being able to expose the different datasets that my team is building to downstream data consumers. Anybody who has been working on data for long enough knows that data is 1 of the harder problems because of the fact that it is inherently stateful. It's fairly straightforward to be able to test and validate and release applications that are largely stateless or where the state isn't
the core element of the business logic. So web applications, they do require state, but you can usually generate some dummy data to at least get a pretty good sense of whether or not the application functions as designed. With data pipelines and data engineering systems, it's not so straightforward as that. You can test the business logic of transformations
in isolation to ensure that the shape of data is being manipulated appropriately, but that only gets you so far. At some point, you have to be able to work with production or production like data to be able to be sure that the changes that you are trying to incorporate are having the desired impact. The real challenge there is that you can't realistically copy your entire production system to a preproduction environment for reasons of cost, time. Oftentimes, it's not even
physically possible to be able to maintain copies of that as well as compliance issues, etcetera, etcetera. So there are many companies out there that focus on trying to generate data for you. I haven't dug into that personally yet. That might be something that we look at down the road, but QA in general for data is really hard. There are lots of different tools out there to address different pieces of that.
So there are things like lake f s for being able to do data versioning. The nessi project is similar for iceberg tables. Snowflake has their copy on right tables. There are some other warehouses that have similar purchase to that. Iceberg has the ability to do some time traveling, but all in all, there are still a lot of moving pieces, and it's not always easy to be able to test changes in isolation effectively before they get into production.
In my own data platform, we have pre production deployments of the different pieces of the stack, but it's not always clear how or when or why to validate different changes because of the fact that the semantics in QA and production aren't always the same. That has led to some issues recently with the overall
uptime and reliability of our production environment, which is something that I'm currently very focused on trying to address. And I wanted to take this opportunity to share some of my thoughts on why it is that the release management process is hard and some of the ways that I'm thinking about trying to break it down into the component pieces and figure out how to address those points where there's overlap between those different systems.
So taking it in turns, air bite, as I mentioned, we use that for extract and load. That just by itself has a lot of challenging aspects to it because of the fact that
not every data source has a non production analog to it. So for the case of application databases for applications that we own, we do have preproduction versions of those databases, and also we want to be able to ensure that we are constantly validating against those QA environments as database changes from the application land to ensure that that doesn't break when those changes get to production.
But there are also a number of data sources that don't have any QA versions. So Salesforce data, there might be a sandbox environment for the company, but it's not likely that it's going to be very useful or even representative of what what is actually in production. Similarly, for things like HubSpot, Mailchimp, a lot of the SaaS services, they're not going to have a QA dataset that you can use.
You can pull all of that same data into QA, but then again, you start running into issues of compliance, cost management where you don't necessarily want to pull all of that data into QA and have 2 copies of it just for testing. So, again, it gets really challenging. In isolation, we can onboard those different datasets, make sure that they can run-in QA, but maybe not sync them all the time.
But where it gets challenging is where you have the overlap of the DBT code that needs to transform, in particular, those QA databases, making sure that the schema changes from the application developers don't break the DBT code, which then brings us into change management for DBT, where in order to be able to ensure that the transformations are having the intended output, that they are semantically reasonable, that they look and feel correct,
it's not very useful to run that against QA data because QA data is often messy, inconsistent. There's no real, concerted effort to make sure that that data is clean and representative of the way things are being used in production. And so most of our DBT changes are actually tested against the raw stage of the production data as it gets landed in the s 3 buckets by air bite, but name spaced to a separate schema from what we expose for downstream data consumers.
So that allows us to be able to have a target with realistic data to be able to test against, but it also means that those changes aren't being run against the QA database data in Airbyte.
So 1 solution that I'll be taking a look at is figuring out how to maintain the set of data connections in Airbyte in QA, QA, 1 to make sure that as Airbyte releases new versions that it continues to function before we push those new versions of Airbyte into production, but also being able to validate those database changes in QA before those database changes make it into production, and so that will be maintaining those set of connections and ensuring that the dbt models that transform
that specific subset of data sources is also executed in QA. So this is gonna require a little bit of engineering work in our pipeline design, in our QA and release management process. And I'll dig a little bit more into that as I get further along in the stack. Fortunately, the query engine layer is managed by the folks at Starburst, so I don't have to get too deep in the weeds on managing the release process of the actual query layer. So that lightens my load a bit, but that brings me now
to DAXTER, which is the orchestration engine. That's where we define the pipelines. That's where we, in particular, ensure that air byte connections are getting synchronized on a regular basis, currently nightly, and that the downstream DBT models are executed.
There are also other pipelines that we develop and so we need to be able to have a means of deploying those changes into a QA environment, verify that the pipeline loads, that it's able to connect with the other systems that are necessary for being able to execute those pipelines, orchestrate those systems and make sure that DAXTER itself will actually come up and run before we push that into production.
1 complicated factor there is that our current method of deployment is we're running a self hosted Daxtr instance and we bake our d b t code into the container image that also holds our DAXTER code. And so the DBT code and the daxter pipeline code are tightly coupled in the deploy process. So a bug in 1 can easily hold up release of the other.
Also, that means that people who are working on the dbt code need to coordinate with the people who are working on the Daxter code so that they know I've got this change, I haven't validated yet, don't run anything into production until this gets fixed. And so that can slow down the velocity of 1 team or both teams.
So that's another piece that I'll be investing in is figuring out how to break that tight coupling and be able to deploy dbt changes into production without necessarily carrying along the same set of changes in the Daxter pipelines. The other challenging piece of the Daxter pipelines is that because there are numerous other systems that it has to be able to integrate with and orchestrate across, that makes validation in local development environments challenging.
Fortunately, there are really great interfaces in the Dykstra framework for being able to define resources that map to the interface that you are expecting.
So you can, for instance, swap in a different implementation of the s 3 resource that maybe talks to the local stack, framework, which is a way to be able to mock an AWS environment, for instance, or being able to use that same s 3 programming interface, but map it to local disk semantics just so that you don't have to worry about all of the file movement piece in your local environment being able to talk to s 3.
There are cases though where that gets challenging, for instance, being able to integrate with Airbyte where I don't necessarily want to have to run an entire Airbyte stack on my local machine to be able to verify that my divester pipeline that orchestrates Airbyte functions properly.
And so 1 of the other pieces I'll be working on there is fixing up the accessibility of our QA air byte environment so that it is reachable from developer machines, whereas right now, it's not accessible based on network rules.
The other piece is investing further in the in our code base and the set of resources and interfaces that we're using so that we do have more of those locally runnable resource definitions that semantically map to their production equivalence so that we can focus more on the pipeline logic and orchestration, scheduling, etcetera so that we can validate changes locally without having to do a whole bunch of work to be able to spider out to different systems that are running in QA or production.
Superset is perhaps the easiest from a deployment perspective to be able to do some validation in QA as we publish new changes where most of those changes are in the configuration of the running superset instance, the set of integrations, so being able to change the login mechanism. Personally, we use Keycloak for our shared sign on, and so being able to ensure that that's running before we get to production, so that's been fairly stable.
Where it gets challenging is that another element of validation is making sure that as we release new versions of Superset, that the charts and dashboards that we are relying on continue to be functional.
The problem there is that there isn't, to my knowledge, a good DSL for being able to manage those chart and dashboard definitions as code and instead it's reliant on manually defining those charts and dashboards in 1 of the environments generating an export of the YAML definitions from the running superset instance and then reimporting them into other environments.
So for right now, that's likely something that we'll be investing in is managing the replication of those definitions from production down into lower environments to be able to have that as a validation method, particularly as we continue to explore the permissions management in Superset of being able to define custom roles, define what data sets they have access to, etcetera, etcetera.
So that is in large part the set of challenges I'm currently running up against a lot of engineering work to do to make everything run smoothly.
The initial low hanging fruit will be managing the air byte connections list in QA to be sets of data that have a QA analog and any singleton datasets that we also need to be able to validate in QA before we publish changes to production and ensuring that those connections are properly configured, properly reliable, executed regularly, and then invoking the downstream set of DBT models for those datasets, at least for the staging layer, to make sure that we
catch any breakages before that change lands in production. The other major area of effort is definitely going to be around enriching the set of resource definitions in Dagster and the access to running systems from local machines so that it is easier to be able to get up and running, verify changes, particularly for people who are doing pull request reviews of those changes rather than just crossing your fingers, deploying to QA, and hoping everything works.
As we get further along, other pieces that we'll be investing in are likely some of the data versioning capabilities of things like iceberg and either Nessie or lake FS, maybe some combination thereof as well as just overall training of the rest of the data team to help them understand what are the steps necessary to be able to validate the changes that are landed in QA before they get pushed to production so that we can all have high confidence that everything that is running in production
has been checked, has been verified, and is as reliable as we can make it given the vagaries of data and the interesting ways in which it breaks. And so also investing more in observability, even more data testing than we already do. So given all of that, I hope that this has been a useful reflection from me for people who are listening, and I'm also interested in hearing other people's experiences of how they approach validation, release management, QA environments
for their data systems. This is a recurring theme that has come up throughout various episodes where I have to talk to people. And so I think that continuing to push on this as an ecosystem is important. I think that we have definitely made a lot of strides from where we used to be a few years ago, but I also think that there still is not any cohesive sense of how to do this from an end to end perspective. I think different systems have their own little pieces of it, and also some ecosystems
have a solution that maybe spans end to end, but it doesn't necessarily bridge outside of those ecosystems. And so if people have thoughts on how they approach QA, if they have thoughts on people who I can bring on to the show to dig deeper into this space or maybe pieces of this space. I definitely appreciate those recommendations, so definitely feel free to send emails or fill out the contact forms on the website data engineering podcast dot com.
And so I appreciate all of the time everybody has taken to listen to the show over the years. I hope it's been valuable to you. I hope you continue to listen and I appreciate that and hope you all enjoy the rest of your day.
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes.
And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.