Hello, and welcome to the Data Engineering podcast, the show about modern data management.
You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.
Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free.
Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles. Get all of the details and try the new product today at data engineering podcast.com/rudderstack.
Your host is Tobias Macy, and today I'm going to be sharing an update on my own experience of the journey of building a data platform from scratch. And today I'm in particular going to be focusing on the challenges of integrating the disparate tools that are required to build able to define certain concerns of that platform.
So if this is your first time listening to the show, I've been running this show for the better part of 6 years now, maybe close to 7. I have been working in technology for over a decade and I have been spending the past year and a half coming up on maybe 2 years building a data platform from scratch to get all the buzzwords out of the way using a cloud first data lake house architecture focusing on DBT for transformation, air bite for extract and load, Dagster for the full integration,
and using Trino as the query engine for data on top of s 3. So recognizing that that puts me in the minority of most people who are building a data platform, particularly if they have small teams that has put me in the position of needing to figure out some of the interfaces for integration
For a lot of people where they're just getting started with building out the data platform, they're probably going to be going with the managed platform route or picking from a selection of different vendors. The canonical 1 for the so called modern data stack is probably Fivetran, Snowflake, and DBT with maybe Looker as your business intelligence layer.
I'm not gonna spend a lot of time in this episode digging into the motivations for why I selected those different components of the stack because I've covered it in other episodes which I will link to in the show notes. But it has definitely led to a certain amount of friction as I try to manage some of the different integrations out of the box. Although the story for that particular set of technologies has been steadily getting better as time goes by.
Today, I really wanna talk about where I am in my journey currently of I have a core set of functioning capabilities. I can ingest data. I can transform it. I can query it. But now I'm getting to the point of needing to be able to onboard more people, provide a more seamless user experience, being able to manage some of the different means of data sharing or data delivery where maybe not everybody who's going to be accessing the data lives within the bounds of my team or my department.
And there are definitely data sharing capabilities that are part of some of the different platforms, most notable being probably big query and snowflake with the ways that they manage data sharing, But there are a number of different ways of approaching that given that I am in the world of building a lake house architecture. I've got my data in s 3. I'm using the iceberg table format. So all of the data is already representable as a table. And
so then the question is, okay, for somebody who just wants to be able to access the data, what's the best way to deliver it? Do I need to provide 1 off jobs to generate a CSV and send it to them via email? Do I need to give an s 3 bucket to be able to load things from? It all depends on what the level of sophistication is of the people who are going to be consuming the data. Maybe it's just a dashboard format and people just need to be able to look at the data.
1 of the challenges of that handoff is also when you need to be able to be considerate of how that data is going to be used after you present it because maybe you want to be able to give a visual representation or give away for somebody to access the data, but you don't want them to then exfiltrate it into another system via means of a c s v export for instance. And that's where you start getting into questions of governance and who has access to what being able to audit data access.
But harking back to 1 of the episodes I did a while ago on the idea of shadow IT in the data industry, the best way to prevent people from taking data out of the context in which you want it to be presented and bringing it into other tools is to reduce any friction or pain that they're experiencing
accessing the data in the way that you have presented to them. Because if your options are the best option and the most accessible option for the people who are viewing that data, then they're not going to want to bringing it out of that system because you were giving them the best experience. And so that's where I am right now of figuring out what is that best experience for everybody,
what are the requirements, how do I then manage that. And a lot of the complexity comes in with the elements of interoperability platform deliberately the overall platform. And I'm using the term platform deliberately because I am aiming for
a holistic experience for end users versus just a number of point solutions where somebody can maybe plug something in and they do their thing and then somebody else can plug something in and do their own thing. I want to figure out what is the minimal set of interfaces that I need to build and support to be able to address the widest variety of needs while still being extensible for the case where somebody has a bespoke requirement
that I need to be able to fulfill? How do I make sure that that doesn't add an undue amount of maintenance burden on myself and my team while still being able to deliver on that request. In general, the way that that degree of interoperability is managed is through adoption of open standards that everybody has agreed upon. So SQL is probably the
longest lived 1 of those open standards. Although, to be fair, there are multiple different varieties of it, but at its core, SQL is understandable. So if you have a means of using SQL to query data, that is going to make it easy for a lot of different tools and people to be able to do their exploration and self serve. DBT has
capitalized on that as a means of being able to build their products to be able to say, okay, the majority of structured data sources are going to be addressable by SQL. So we're going to build a tool that allows people to build a full engineering development and delivery flow for that SQL data, manage the transformations through our tool, and then add a lot of nicety around it in terms of the lineage and data documentation, etcetera.
Because of the fact that they have invested so much in the ease of use of the tool as well as doing a lot of advocacy in the community to drive adoption that has led to a number of different other tools integrating with them. So they have become a de facto interface for managing transformations in a warehouse or warehouse like context. So there are tools such as preset that are building integrations into d b t as far as the data types and the metadata that they generate.
There are entire products that are built on top of the metadata that d b t generates, light dash is 1 of those. So that has given DBT as a tool and as a company a lot of momentum, a lot of inertia, so it makes it more difficult for other people who maybe are targeting a similar use case, but want to add a step change to some of the capabilities around that tool
to be able to actually break in and overtake that market. So the most notable 1 that I'm aware of so far is SQL Mesh who are actually adding a compatibility layer with DBT for being able to run SQL mesh on top of a DBT project where you can actually execute the transformations without having to do any code changes, which solves for the adoption step of I just wanna be able to try out this tool, but then there is still the challenge that they're going to have to overcome of
building up that set of integrations with the broader ecosystem that DBT has benefited from. So that makes that road to adoption and the road to being a viable competitor a lot longer.
Another tool that has benefited from this status of being a de facto standard and a reference implementation is airflow, where they have been around for a long enough time, and they have been adopted by a large enough user base that if you, as a new tool vendor, build your initial integration with airflow, then you have a large enough addressable market that it is worth the time of building that integration, but it's not necessarily worth it to build that same integration for DAXTER or Prefect
or 1 of the other orchestration systems that are out there. So as you are building your own platforms and doing this tool selection, it's worth considering, am I choosing something that is going to be able to benefit from the existing weight of the community integrations that are available for it. If not, then is the value that it is providing worth that additional complexity of me having to go out and build those integrations or work with other vendors to build those integrations.
Developing event driven pipelines is going to be a lot easier. Meet functions. Memphis functions enable developers and data engineers to build an organizational toolbox of functions to process, transform, and enrich ingested events on the fly in a serverless manner using AWS Lambda syntax without boilerplate. It includes orchestration, error handling, and infrastructure almost any language, including Go, Python, JavaScript, dot net, Java, SQL, and more.
Go to data engineering podcast dotcom/Memphis today to get started. And then another element of that integration question is the challenge of a lot of the different tools in the stack will want to be the single source of truth for certain aspects of your platform. 1 of the ones that I've been dealing with recently is the question of access control and role definition
and security and auditability of the data. So because I am building with a disaggregated database engine where the storage and the query access are separated, that gives 2 different locations where those definitions
can be stored. So 1 camp will say, well, the because of the fact that the storage layer is the lowest level and will have potentially multiple different points of access where maybe I'm using Trino to query it by SQL, but I may also just be accessing the data in s 3 directly using some Python tool or maybe I'm using something like Spark or Flink to do other processing approaches to it beyond just SQL. So
all of that access information needs to live in the storage layer. So that's the approach that companies like Tabula are taking with the iceberg table format, where as long as you have the storage layer secured, then it doesn't matter what all the other layers on top of it might have to say about access control because it's going to be enforced at the table level.
And then you go the next level up, engines like Trino, or the Galaxy platform from Starburst, say, we want to own the access control because we are going to have more visibility into the specifics of the queries, and we can have things like rollable filtering in the role definitions. So we should be the ones to own that role based access control information or attribute based access control.
And then that's definitely another viable option, but then you go another layer up the stack into business intelligence, and maybe there's even more granularity available because you can say, oh, well, this person has access to this particular dataset with this bay low level filtering and they can view these visualizations on that data, but maybe they can't write their own queries against that data. So there's that challenge of, okay, well, do I have to define the roles in 3 different places?
Do I have to have slightly different roles across all those 3 places? How do I align them to be able to say, well, this user in the business intelligence is the same as this user, and the query engine is the same as this user in the storage layer and make sure that they are getting a cohesive experience across those boundaries. So particularly when you have tool systems that maybe don't want to or that are maybe disincentivized to do that
propagation of role information. So for instance, if I use tabular and I say I'm going to do define a role that will grant read access to this subset of tables, but no other visibility of the rest of the dataset, how then do I reflect that information to the query layer of who that user is to be able to enforce those permissions, and then all the way up to the business intelligence layer to say,
from the storage layer, these are the permissions that you have in the UI. So that brings in the need to have some manner of single sign on and single source of identity for people across all of those boundaries. Beyond the question of permissions, there's another set of information that is
disjoint and wants to be owned by different components within the stack. And that is the question of data flow, data processing, data lineage, where each tool maybe has a certain view of what the lineage graph looks like or what the processing looks like.
But you don't necessarily have the complete end to end view of a piece of data from maybe where it lands in an application database all the way through to where it's being presented in a business intelligence dashboard or incorporated as a feature for machine learning model training workflow. So for instance, DBT has the table level lineage of the transformations that it's providing.
Tools like airflow and Dagster have the view of the lineage of all of the tasks that they are responsible for executing, but they don't necessarily have information about out of band transformations that are happening by some analysts who are running ad hoc queries in the data warehouse, or they don't they maybe don't have visibility into the data as it's landing in a in an application database, and they only see the data once it lands in the warehouse or maybe they can see the,
data integration step with an Airbyte or a Fivetran to say, okay, I know this table in the application database feeds into this table in the warehouse, and then these DBT flows happen, but maybe it loses sight of the data dashboards that are being generated. And so, again, this is the question of open protocols, interoperability. So tools like open lineage are designed to help address that where you fire events that will can be then constituted into a more cohesive lineage graph.
And then you also have systems such as metadata platforms that are designed to be more holistic views of the entire ecosystem and incorporate things like data discovery, data governance, which gives you a single place to be able to view all that information, but then you're back to that question of integration of, okay, well, this can store and convey all of this information,
but how do I get all this information into it? So each of those tools are going to have different means of being able to push or pull data, but you have to make sure as the platform designer and operator that those data flows are also happening. So it's an additional set of DAGs that you need to make sure are running. You need to make sure they're reliable. So it it's useful, but it's also an additional burden.
So these are all things that I've been dealing with recently. And then in the metadata catalog situation, even if you do manage to feed all of your data into that, it is useful as a means of discovery or a means of being able to, keep tabs on what's happening, but then it also feeds back into, okay, well, if I want to use this as my single source of truth,
how then do I propagate that truth back into other systems? And that's where you start to get into questions of things like active metadata, and then you have another set of integrations and another another direction of integrations where if I say, okay, I have my metadata catalog,
this is my source of truth for role information and who can access what. Now I need to be able to push that back down into the storage layer and into the query engine and into the business intelligence dashboard, and I need to make sure that all of those integrations are reliable and that there are appropriate mappings between the different concepts throughout the different systems. Data lakes are notoriously complex.
For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.
And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake, and Hoodie, so you always maintain ownership of your data. Want to see Starburst in action? Go to data engineering podcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
So this is definitely 1 of the benefits that fully vertically integrated platforms have of you don't have to fight with all of those different layers of integration. But the problem there is that you have to rely on the integrations that have been built and maybe you're constrained by what that vertically integrated platform offers.
You don't necessarily have the means of being able to extend it into areas that the platform developers haven't had the time to implement or haven't been exposed to as a necessity. So it's the the continual story of software design, software development of how do we build these systems that are extensible and can be integrated, but also make sure that the product that you're building doesn't just get crushed under the weight of having to maintain all of these different point solutions.
I think that we're definitely iterating towards these community standards. I think that tools like Open Lineage, I think that the work that the open metadata folks are doing with their schema first approach to metadata is interesting.
I think that the work that DBT has done to become that de facto standard for this is how your transformations are represented, so other tools can be able to build on top of that are all valuable. So it's great to see the direction that that has that the community has taken on all of these fronts, but I do think that we are definitely still not to the point where we have a lot of the answers
fully baked. I think that I think that everybody who is investing in this ecosystem, everybody who is building these tools and using these tools and giving feedback to the tool vendors is helping to bring us to a better place.
But, as somebody who is trying to integrate so many different pieces and try to figure out what does that holistic platform look like, how do I build it in a way that I can maintain it with a small team but also be able to have the flexibility required to address a wide set of audiences. It's definitely still a challenge, but 1 that I've been enjoying having the opportunity to explore and invest
in. So now as I have been iterating on these challenges and thinking through how best to build that holistic platform and something that is going to be enjoyable and usable by a number of different people, the next main architectural component that I've been
starting to work towards is that metadata platform so that I can have that more cross cutting view, so that I can improve the data discovery story for people who aren't part of the engineering team, who just wanna see what data do you have, where is it, how do I access it. So that's where I'm going to be doing some of my next work of picking the metadata platform, getting it integrated, getting all of the data flows propagated to that tool so that we can see how everything is flowing,
and then being able to start integrating single sign on as a means of identity management across the different layers and then being able to say, okay. You came through the metadata platform to do data discovery. Now you say, okay. Here's the dataset that I want to, explore, or here's the chart that I want to view, then being able to have
a simple means of clicking a button and jumping them into the experience that they're requesting, for instance, or being able to say, okay. I need to query this database, and then giving them the prefilled set of credentials or the prefilled client connection to be able to run those queries with their tool of choice, whether that be something like a tableau or a SQL command line, etcetera.
And as I have been doing this work, the most interesting or innovative or unexpected aspects have definitely been these integration boundaries of saying, okay. This is the tool that I have. This is the other tool that I have. I would like to be able use them together in this manner. Now, how do I go about and do that? So I am definitely happy that the, Python language has become 1 of the most widely adopted ecosystems
for data engineering because it does simplify some of that work where you can pretty easily assume that there's going to be a Python library that does at least 80% of what you're trying to do. So in, for instance, the open metadata platform, they have a Python library for doing that metadata ingestion. So even if they don't have an out of the box solution for ingesting metadata from the tool that you're using, they do have a Python client that you can provide metadata to them in order to be able
to propagate information from a system that they haven't already addressed. So I do think that that is a another good trend in the ecosystem of providing a good set of software clients to be able to integrate with their tool even if it's from a system that they themselves didn't already plan to integrate with. So while it does add a bit of extra burden to the people who are trying to use those systems, there is at least a
path to success for it. I'd also say that that's probably the most challenging lesson that I've learned as well of figuring out what are the points of integration that are worth investing in and what are the ones that I don't have to invest in right now, and maybe I can wait to see if the ecosystem around that tool grows up. So SQL mesh is a tool that I've been keeping an eye on.
When I first came across it, it didn't have support for Trino, so that was an obvious no for being able to use it. Now they have it, but they don't yet have an integration with DAXTER. And then also, as I was pointing to earlier, the existing weight of integration with DBT gives me a bit of pause of okay. Even if I do get it working with Trino and Dagster, what about all the other prebuilt integrations that I am
not going to be able to take advantage of because I have decided to use a newer tool that hasn't been as widely adopted and integrated. So definitely a challenging aspect of that as well. If you have found my musings useful or informative or if they have inspired something that you would like to discuss further, I'm definitely always happy to take suggestions or bring people on the show or,
take some feedback. So for anybody who wants to get in touch with me, I'll add my contact information to the show notes, but the best way is on data engineering podcast.com that has links to how you can find me.
And for the final question of what I see as being the biggest gap in the tooling or technology for data management today, I definitely think it's that single source of truth for identity and access across the data stack of being able to figure out how do I manage permissions and roles across a variety of tools without having to build different integrations every single time.
So definitely look forward to seeing more investment in that. Maybe even using something like open policy agent from the cloud native ecosystem. So definitely, definitely happy to continue investigating that as well. So thank you for taking the time to listen to me. I hope it has been, if not informative, at least entertaining. I'm very thankful for being able to run this show
and bring all of these ideas to everybody who listens. So in honor of this being the Thanksgiving week, just wanted to share that gratitude to everybody who takes the time out of their life to pay attention to myself and the people I bring on the show. So thank you again, and I hope you all have a good rest of your day.
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it.
Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.