Logical First, Physical Second: A Pragmatic Path to Trusted Data - podcast episode cover

Logical First, Physical Second: A Pragmatic Path to Trusted Data

Jan 25, 202641 minEp. 498
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Summary 
In this episode of the Data Engineering Podcast Jamie Knowles, Product Director for ER/Studio, talks about data architecture and its importance in driving business meaning. He discusses how data architecture should start with business meaning, not just physical schemas, and explores the pitfalls of jumping straight to physical designs. Jamie shares his practical definition of data architecture centered on shared semantic models that anchor transactional, analytical, and event-driven systems. The conversation covers strategies for evolving an architecture in tandem with delivery, including defining core concepts, aligning teams through governance, and treating the model as a living product. He also examines how generative AI can both help and harm data architecture, accelerating first drafts but amplifying risk without a human-approved ontology. Jamie emphasizes the importance of doing the hard work upfront to make meaning explicit, keeping models simple and business-aligned, and using tools and patterns to reuse that meaning everywhere. 

Announcements 
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
  • You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/Build
  • Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
  • Your host is Tobias Macey and today I'm interviewing Jamie Knowles about the impact that a well-developed data architecture (or lack thereof) has on data engineering work

Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving your definition of "data architecture" and what it encompasses?
  • How does the nuance change depending on the type of system you are designing? (e.g. data warehouse vs. transactional application database vs. event-driven streaming service)
  • In application teams that are large enough there is typically a software architect, but that work often ends up happening organically through trial and error. Who is the responsible party for designing and enforcing a proper data architecture?
  • There have been several generational shifts in approach to data warehouse projects in particular. What are some of the anti-patterns that crop up when there is no-one forming a strong opinion on the design/architecture of the warehouse?
  • The current stage is largely defined by the ELT pattern. What are some of the ways that workflow can encourage shortcuts?
  • Often the need for a proper architecture isn't felt until an organic architecture has developed. What are some of the ways that teams can short-circuit that pain and iterate toward a more sustainable design?
  • The common theme in all of the data architecture conversations that I've had is the need for business involvement. There is also a strong push for the business to just want the engineers to deliver data. What are some of the ways that AI utilities can help to accelerate delivery while also capturing business context?
  • For teams that are already neck deep in a messy architecture, what are the strategies and tactics that they need to start working toward today to get to a better data architecture?
  • What are the most interesting, innovative, or unexpected ways that you have seen teams approach the creation and implementation of their data architecture?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working in data architecture?
  • How do you see the introduction of AI at each stage of the data lifecycle changing the ways that teams think about their architectural needs?

Contact Info

Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Transcript

Tobias MaceyTobias Macey

Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Composable data infrastructure is great until you spend all of your time gluing it back together. Bruin is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.

Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud.

You're a developer who wants to innovate. Instead, you're stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It's a flexible, unified platform that's built for developers by developers. MongoDB is ACID compliant, enterprise ready with the capabilities you need to ship AI apps fast. That's why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at mongodb.com/build

today. Your host is Tobias Maci, and today I'm interviewing Jamie Knowles about the impact that a well developed data architecture or lack thereof has on data engineering work. So, Jamie, can you start by introducing yourself? Hi. I'm Jamie Knowles. I'm product director for ER Studio at Idea. And do you remember how you first got started working in data?

Jamie Knowles

Uh-huh. Many years ago. Yeah. So I started back in the late nineties helping insurance companies and financial advisers move from good old fashioned proprietary EDI pipelines to shared XML standards. I also did some interesting projects around the UK police force on defining common information models across 43 autonomous police forces and the projects at the time was really sort of stopping,

criminals slipping between the cracks between those those, police forces. It was driven by the the Soham murders, so they wanted to able to share data, across, different regions. And that really got me fascinated on how meaning and structure enable scale which really pulled me into going to work for the data modelling tools. Joined a company called Popkin Software with System Architect back in 1999 and then I've been working on products like Erwin and today, ER Studio ever since.

Tobias MaceyTobias Macey

And so in terms of the question of data architecture, it's one of those phrases that gets used a lot in various different contexts. Everyone has a different frame of reference or area of focus, and I'm wondering if you can start by just giving your definition of what it is and what it encompasses.

Jamie Knowles

Yeah. Sure. So I think architecture in general is about the big picture. It's not just individual designs. It's about creating shared patterns that give you consistency and interoperability across designs and projects. Now when I talk about data architecture, I mean the design of how data represents the business and how it's used over time.

So it starts really with business meaning. So focus on key concepts, how are they defined, how do they relate, attributes of those those concepts and business keys, how do we identify these things uniquely, which are really important thing in in analytics. That's the the logical layer, it's defining the structure and meaning of information and how the business really works. So an example, someone might be interested in, hey, I want some customer revenue figures.

So in sales, a customer might be anyone with assigned contracts, but in finance, it's only someone who's been invoiced. But in support,

it's someone with an active case, same words, different meanings, and we need to understand that. And then something like revenue, is it booked, billed, recognised or cash received? Is it valid? Or each of these is valid but only in the right context. So revenue, I mean revenue itself isn't just a single thing. If we've got a B2C customer it might be a point of sale, immediate and transactional in a B2B context it could be contracted,

usage based, recognised over time or spread across multiple services. Same term but very different calculations. And the only way to manage that at scale is to model the context explicitly so systems and teams know exactly which definition applies and when. So that's the sort of logical side of things, understanding the business and then from there on we can realise those logical models as technical models, creating schemas for platforms,

pipelines and sorting out governance. And I think the big thing that that we see, the mistake that people make is equating data architecture with physical models alone. So without a business driven or logical foundation, data engineering might work in the short term, but it's not going to scale with with trust or clarity.

Tobias MaceyTobias Macey

And one of the other interesting aspects of data architecture is to your point of it is the bigger picture, but everyone's idea of what that picture is about changes and also the technology involved varies across the board.

And I'm curious what you see as some of the variance and nuance around that concept of data architecture depending on the type of system that you are designing and the role that it plays both within the larger technical ecosystem and within the overall organizational ecosystem, whether that's a data warehouse versus a transactional application database versus an event driven streaming service, etcetera?

Jamie Knowles

Yeah. That's that's a great question. And I think the nuance absolutely changes by system type, but that starting point shouldn't. A semantic model is the the universal core of everything. That's where you've got to start. I mean for transactional systems the model is usually tightly coupled to business processes. We could design the system in isolation but having its data assets aligned

with the broader business definitions means that interoperability and governance of those systems is easier. For warehouses and analytical platforms, the focus shifts towards understanding the business at rest and over time. You're modelling how the organisation wants to analyse itself. So the logical model captures those shared business concepts and relationships across domains. And from that you deliver you deliberately

derive dimensional models, vaults or other analytical structures depending on the questions being asked. And I think the key risk here is jumping again straight to those physical structures without agreeing on what is a customer product or transaction, what does it mean. And then for event driven streaming services the model often describes behaviour and change rather than state.

Events represent business facts that happen at a point in time. But again, the logical model helps define what those events mean and what guarantees they carry and how they relate back to core business entities. So I think you're getting the gist here. It's all about meaning.

Tobias MaceyTobias Macey

The other interesting wrinkle in this question of data architecture specifically, but also just architecture more broadly, is whose responsibility it is. So if you have larger engineering teams, you will generally have a dedicated architect role or might have a set of principal engineers who are also responsible for architecture. But if you're in a smaller team, it's generally going to be something that is a

consensus driven approach or just who first mover wins type of thing. And so the architecture can end up being a bit more organic rather than deliberate. And I'm curious what your thoughts are in terms of who is ultimately responsible for ensuring that there is a cohesive and holistic architecture rather than it just being an exercise in organic growth?

Jamie Knowles

Yeah. I think someone's gotta gotta take that role. Data architecture works best when it has clear ownership, not when it emerges accidentally. So someone's gonna be appointed with that role. So appoint a data architect or a data architecture function, someone can do it on a on a part time basis. But they're not gonna be working in isolation. They're really a facilitator.

Their role is to define and maintain the business driven logical model. So create the shared standards and guardrails that spanned teams and systems and every time I talk to any engineering team or data architects, the hardest part of the process is getting the business to define exactly what they want. So we've got to engage with those business domain experts. We've got to make them accountable for validating meaning and definitions. But the data architects are the facilitators and

usually they're skilled in sort of drawing that out and documenting it in a form that can be approved. So someone's got to take that role.

Tobias MaceyTobias Macey

One of the other interesting aspects of data architecture when we're looking specifically at data warehouses is the generational shifts in the technical realities of what those warehouses are used for, what they're capable of,

and who is responsible for them or who's working on them. And as you have gone through these different generational shifts where I think maybe the first real concrete architectural realities were oriented around the Inman and Kimball style data warehouse architectures where you're either doing a third normal form relational style for the entire warehouse or you're doing more of the star schema approach of dimensional modeling that Kimball popularized,

and then you get into things like DataVault, anchor modeling, etcetera. But when you have a team who is just tasked with, hey. Give me all the data I need for the business, and they don't necessarily have that background of data architecture or some of these more codified warehouse design patterns? What are some of the anti patterns that typically crop up if you don't have that clear sense of ownership and everybody is just trying to do the next best task?

Jamie Knowles

My toes are curling in my shoes here. Yeah. Semantic entropy, you're just going end up with something that's no longer scalable, it's no longer manageable, it's ungovernable, it's just a risk to the business. So you're going to see schema sprawl, every team morals the same business concepts differently, you end up with multiple versions of customer order revenue. It's all technically valid but semantically completely incompatible.

So yeah, we see a lot of companies just jumping straight to physical models, teams just start building stars and marts and views optimised for short term reporting. They don't spend the hard yards working on that shared logical foundation so reuse and change is just painful later on. We also see pipeline led design where the shape of the warehouse is dictated by

source systems or tooling rather than the business question So the warehouse just becomes a mirror of operational systems instead of a business asset that all sort of build it and they'll come and over time that just leads to fragile downstream dependencies where we've got undocumented assumptions, slow change. And most of these issues aren't called caused by bad engineering, but it's just the absence of of clear business driven data architecture gonna guide your decisions up front.

Tobias MaceyTobias Macey

One of the interesting things that you just brought up there is that idea of the warehouse being reflection of the operational systems. And to some people, that might be considered a feature rather than a bug where I know in my experience, I've had cases where somebody has gone into the data platform because they want to do some queries about the state of the application database to understand some application specific reporting more than a specific business requirement,

and they were surprised when they saw duplicative records because of the way that the raw layer was being loaded with an append only format and trying to be immutable or they're being deleted records that weren't deleted in the data platform.

And I think one of the contributing factors of that is the growth of ELT as a pattern where you are just replicating that source system straight into the warehouse with the intent of it then being transformed into the more architecturally consistent structures. And what are some of the ways that you're seeing this ELT capability encourage some of those types of shortcuts or misconceptions about the purpose of that warehouse environment?

Jamie Knowles

Yeah. I mean, ELT makes it easy to defer thinking. Your data lands cheaply and quickly. You tend to load everything, figure out meaning later. So postponed modeling, business definitions are implied in SQL and transfer information has become de facto documentation. So you're just going to see lots of shortcuts like designing directly in physical tables and letting source system structure dictate analytics. It works great early on but semantic drift occurs

and changes then just becomes really expensive later on. So without that logical, that clear logical model upfront, ELT optimizes for speed today but at the cost of understanding tomorrow, simple as that.

Tobias MaceyTobias Macey

And for teams particularly who don't necessarily have that strong opinionated architect and they are using that ELT pattern for the benefits that it does provide, but they don't necessarily understand the bigger picture of how to wrangle the organization into actually contributing their opinions about the business semantics. What are some of the ways that you're seeing teams

address some of that challenge of, I really need to deliver something quickly. I'm trying to use an agile approach to my warehouse design, but I also do understand that these are going to be short term wins at the expense of long term pain and just some of the ways that teams should be structuring some of that, thinking about the the units of work in terms of that agile delivery flow.

Jamie Knowles

Yeah, I think we hear a lot of objections, sort of, yeah, we can't we're not going to make any progress until we've done all this sort of a normal, enormous architectural program. And so that's a big misconception. We don't have to architect the world before we can start. We've got to be pragmatic. The goal isn't a perfect model upfront. It's just to introduce enough business driven structure to slow down that semantic entropy as the systems grow.

So teams can start small, model what's actually being used, let the architecture

evolve alongside delivery. So picking a handful of high value business concepts is probably a good place to start. Define them properly in a shared logical model customer, product, order, revenue make those definitions explicit and agreed and then map existing tables and pipelines to them and then as you go further you can start pushing standardisation and naming conventions, conform dimensions and reusable patterns, and then enforce them through lightweight governance

in the delivery workflow, like design reviews against the model. It's about building business driven logical backbone.

Tobias MaceyTobias Macey

In terms of that question of business involvement, that can also be a point of difficulty in the development of that overall architecture because in order to be able to build a mirror of the business, you need to understand the business, but also you need the people in the business to be able to understand enough about the questions that you're asking to give you useful answers. And I've definitely read various books about methods for being able to do that or ways of framing those conversations.

But there's also the pressure of the business saying, hey. I just want the data. Give me the data. And the data team's being pushed to say, okay. Fine. I'll deliver something, and then nobody's happy.

And I'm curious what are some of the strategies that you see as far as teams being able to push back a little bit on that delivery pressure and help the business understand the need for that deeper dive on getting the overall semantic model of the business so that the engineering team can deliver the things that the business users actually want versus the things that they say they need.

Jamie Knowles

Yeah. Hearts and minds. I think this is gonna go all the way back to to leadership. Leadership needs to recognize sort of the importance of these semantic models, these knowledge models, these data models. And if they're not done properly, then they're gonna end up with again scalable,

unmanageable, ungovernable risky outputs. So getting leadership on board is probably the first step and then the business has got to be aware of their role in this. We need subject matter experts to be engaged and to provide their knowledge and when people ask for outputs, they've got to be prepared to sort of work through it and give the proper information but this is what data architects do. We've been doing this sort of since the dawn of time building databases,

that's what we're skilled at. So get us involved, we'll get everybody nailed down and get the information in place, get the models built. But yeah, I think it's hearts and minds. We've got to get the leadership involved first and, get them understanding. And data engineers themselves, need to recognize the importance of this stuff. So when somebody comes along with a half baked requirement, push back and say we need this fleshed out properly please. Can we engage with the data architect?

Tobias MaceyTobias Macey

One of the other interesting aspects of where we find ourselves right now is given that question of translating between the business user and the engineer. It's often been an exercise in frustration if you don't have people who are trained to be able to bridge that divide.

And now with the introduction of generative AI and this natural language interface to virtually everything that you want to plug it into that can either improve or cause greater problems in terms of that question of translation

and being able to travel back and forth between those different modes of thinking and those problem domains. And I'm wondering how you're seeing some of the ways that natural language interfaces or natural language query approaches for data models, data catalogs can be used to facilitate those conversations and help to surface some of the

business requirements or the types of questions that are being asked by the business users and turn those into a concrete task that the engineering team can deliver on.

Jamie Knowles

Yeah. Danger. Danger, Will Robinson. I mean, this this is semantic entropy again and and where things can get really, really dangerous. So I think that the process and building blocks don't change. We've still got to work out these these these semantic models, these business models, these data models. There's a belief that AI can just do it all for you and really that's not reality and I'm not even sure there ever will be reality. So AI is great at accelerating

first cut business models and generating code but humans still have to refine and validate them. As you say, lot of the tools have now got AI built in. EOS Studio for instance has got some great AI, tools that sort of kick start the creation of models will help sort of suggest structures but the human being still has to review them, validate them and understand them in the context of this organisation. Make sure they balance against the sort of policy synonyms, homonyms of the organisation.

Make sure all the information is there. And I think on the other end of the spectrum, the AI doing the analyses is where there's even more danger. So we're seeing AI tools built into BI tools so we can provide natural language queries. Some of them will even create signals sort of based on information and patterns that it finds.

That has to be grounded in that accepted business meaning. If the AI doesn't understand what it's looking at, the data that it's looking at, then it's going to come back with garbage. Likewise when you you give it a natural language query, same thing that the human being faces, the AI has to understand what do you mean by that question? Again, you're talking about customer, what type of customer? Are you talking about revenue? What what type of revenue is it? So it's the same old thing and I think with AI, we now yeah, you type in your question into the AI, you get an answer back and you trust it. You've you've got no idea whether it's sort of hallucinating behind the scenes. Perhaps in in the past, you ask the human that question. The human will will go through the the questioning process with you beforehand.

And so I think it's even more important now to have these these models done, have these knowledge models in place.

Tobias MaceyTobias Macey

If you lead a data team, you know this pain. Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data while keeping it all secure.

Type a prompt like build me a self-service reporting tool that lets teams query customer metrics from Databricks, and they get a production ready app with the permissions and governance built in. They can self serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com slash Retool today, that's r e t o o l, and see how other data teams are scaling self-service.

Because let's be honest, we all need to retool how we handle data requests.

One of the interesting approaches that I've seen with some of these AI native analytics tools is that it can actually help to also surface some of the types of questions that would have been lost by the engineering team because maybe they're just clicking around a dashboard, getting frustrated, giving up, or they're just generating a data export and doing whatever they want in Excel, and you lose that thread as far as what people are actually doing versus what you're delivering.

And having the record of the conversations that the business teams are having with those AI interfaces of the questions that they're asking, being able to understand

when they need to repeat things or drill down further or when they just give up can also be a feedback mechanism to help get some of that implicit signal of what is the data that we actually need. The Dagster team with their compass model, they actually bring that directly into Slack. And one of the ways that they've iterated towards that is if somebody asks for information

about a dataset that doesn't exist yet, then it they can actually automatically log a request to say, hey. I want this data, and that's another signal as far as, okay. Well, this is a task that we need to take on. And I'm just wondering what your thoughts are as far as some of that implicit information that these interfaces can provide and some of the ways that it can encourage laziness in the engineering team by saying, well, we don't need to do the architecture upfront. We'll just wait for people to stub their toes and bump their shins on things, and then we'll fix it afterwards.

Jamie Knowles

Yeah. I mean, you're just opening up to risk though, aren't you? I mean, somebody might just sort of make some assumptions based on what they're getting and, make some pretty hefty business decisions that, destroys the company. It's it's too risky. I think we've got to have the the human interventions that the human beings clarifying every step of of the workflows. And one of the other aspects of

Tobias MaceyTobias Macey

the technical realities going back to these generational shifts both in terms of the warehouse capabilities as well as the data models as we do move to a world where AI becomes that mediating substrate, what are some of the ways that you see that potentially contributing to a new generational shift in terms of the warehouse modeling architecture

where maybe because the AI can bring in a broader amount of context faster than a human can, maybe it leads us to an architecture where the technical realities are paramount and they're less focused on the human understandable semantics of the table relations where maybe we're laying things out to be as efficient as possible on disk versus making sure that the dimensional structures are screwed able by a human.

Jamie Knowles

Yeah. I think it still all goes back to meaning. We've still got to have accepted meaning, and I think the humans still have to sign off on that. So I think we're gonna see some clever agentic workflows with with humans and machines sort of working together and some interesting feedback loops. There's with AR Studio, we're doing a lot of work on this sort spew

out RDF structures and scoffs linkages to the AI. We're looking at making AR Studio sort of part of a MCP sort of platform so that it can contribute to agentic workflows. But I think we're we're a huge way away from just letting the machine do everything. The humans still have to be involved. We still have to have that human curated and approved ontology, semantic model, business model, knowledge model, data model, whatever you want to call it at the center. The AIs will be sort of helping

add to it, build it, refine it but the humans have still got to sign off. To your point of the RDF schemas and the ontologies,

Tobias MaceyTobias Macey

knowledge graphs, that's a conversation that has resurfaced quite strongly recently because of the fact that a lot of these models do need that grounding in terms of the relational in terms of actual relationships between things, not relational in terms of the relational algebra case.

And I'm wondering if maybe that's another trend that might push us into an evolutionary shift in terms of the database technologies or the storage engines where maybe that graph semantics becomes a more embedded primitive in terms of these engines that we're building versus just the two dimensional relational tables.

And maybe that will encourage us to have more of that hybrid representation or a built in capability to embed those knowledge graphs and those semantics into the two dimensional data structures that we've been stuck with for decades.

Jamie Knowles

Love it. Love it. With with some caveats. So yeah, an organization having a single accepted ontology. Beautiful. Love it. But I think in practice, we were seeing multiple ontologies. At the moment, we're seeing the data governance folk creating their business glossaries, okay, and arranging those terms into models. So that's one ontology for the purpose of governance. We've got the data architects building their ontology in the form of

data models, an enterprise data model. We've got the AI building out knowledge graphs. So we really need to sort of combine them. Again that's that semantic entropy. We just need one clear set of definitions with all of the homonyms, policymes, etc. All sort of built into it, all human agreed that can drive everything. Think definitely that's that is the future. We've just got to make it a single unified

Tobias MaceyTobias Macey

semantic model. And that's one of the challenges that we've had where we do have graph engines, we do have different data stores that are optimized for different use cases, but then you're in that polyglot persistence world where you have to synchronize data between two or three or five different systems to be able to get the answer to one question.

And there are various frameworks that can assist with that, but having it be a native capability of a single engine, I think, will encourage a more concerted investment in building those ontologies, which also recognizing that you can't have your cake and eat it too. There are limitations due to physics, etcetera, and computer science has yet to solve the the problem of physics.

Jamie Knowles

Sure. Yeah. I mean, but I think the tooling will help and this this sort of notion of of polyglot modeling is is is not new with with data modeling tools, having an enterprise logical model. Great. Nice ERD. That effectively is your central ontology and then the tools will then generate that as physical models for traditional databases or JSON structures or whatever form you like and then sort of refactoring that that knowledge model into different physical realizations.

It's just a transformation exercise. But yeah, having one single central ontology that drives all this, great. And to that point of tooling, when we're briefly referencing application database

Tobias MaceyTobias Macey

schemas, that also brings to mind things like ORMs,

which can accelerate delivery but also lead to suboptimal data models and plus one queries where you say, I've written my code. I understand what are the things that I want, and the database just becomes a reflection of the code rather than vice versa. And there's also the potential for that to happen going back to that question of ELT, a lot of the tooling that we have developed in the past five to ten years of making it easier to treat transformations and

sequences of SQL queries as a software engineering exercise.

And I'm wondering what you've seen as some of the ways that that pattern can also contribute to suboptimal data models and suboptimal data architectures because it's easier to build it and you're it's faster to deliver and doesn't necessarily require as much upfront design where you say, oh, I just wanna be able to pull this thing out. So I'm gonna create a transformation, and then you end up with dozens of extraneous tables versus a more concise and clean core representation.

Jamie Knowles

We focus on the models at the center, get the models right and then everything else falls out from it. I mean, same problem, isn't it? Absolutely.

Tobias MaceyTobias Macey

I think the challenge is that by making things faster and easier, you're removing that point of friction that forces people to think.

Jamie Knowles

Bingo. Yeah. That's it. You've gotta do those hard yards up front, really. Think about what is it that we're talking about here.

Tobias MaceyTobias Macey

And so for teams who have iterated themselves into a corner where they have delivered the things that the business has asked for, but not the things that they actually need and are at a point where every new request takes an exponentially greater amount of effort, what are some of the concrete strategies and tactics that are useful for them to be able to start working themselves into a more cohesive data architecture?

Jamie Knowles

Yeah. This is a hard one. I don't really have a good answer for it. We get a lot of this from our customers. So what do we do? Do we do we start afresh? Do we start something new in parallel? Or do we try and sort of refactor what we've got? I think a lot of the time it's it might be down to let's start afresh with something new and clean, take a new approach on it all. It's a tough one.

Tobias MaceyTobias Macey

And with that ELT approach of taking the Databricks model of the bronze, silver, gold representations or the staging intermediate mart that the DBT tooling promotes, what are some ways that maybe you can use sort of the facade pattern to say, I'm going to keep the terminal nodes the same for now, but I'm going to rebuild those middle layers into being the actual dimensional warehouse and use that as a way to be able to iterate towards a more consistent and clean representation

at that delivery layer while keeping the interfaces the same until such point as you could say, okay. Now I've rebuilt the middle layer. So now that top layer can either go away in pieces or we can actually build better or more comprehensive reporting on top of that.

Jamie Knowles

Gosh. Yeah. There's a lot lot of options there, but I don't think you can do any of this until you've you've gone back to the basics. You've got that overarching logical model. I mean semantically all those layers should be pretty identical. So start off with the process of identifying the the core concepts, get them fleshed out, and then start reviewing the different layers and against them.

Tobias MaceyTobias Macey

And so in terms of that question of leadership buy in, that could also be a bottleneck because leadership often has dozens of different competing priorities. And if you are an individual contributor, you're very passionate about the challenges that you're facing as far as trying to buy down technical debt in terms of your data models.

Jamie Knowles

What are some of the ways that you can try and bubble that up high enough to get the space to be able to actually do that work versus just the constant pressure of, no, you just need to deliver the next thing? Oh my gosh. This is this is a hard one. So yeah, I mean, we we we see it all the time. So business leaders say, I'm only gonna be here in the job for two or three years. So, let's just smash out answers and, as quickly as possible and then leave somebody else with the problem. It's whoever owns the long term story and we've just got to do a good sales job on them that look if you want something that is scalable and manageable over time then we've got to do the hard yards. If we just want to build something use it for a while and then tear it down and start again, great. But let's be aware of that right at the start.

Tobias MaceyTobias Macey

In terms of your experience working in this space and in working with customers who are trying to understand how best to iterate towards and represent that data model, what are some of the most interesting or innovative or unexpected ways that you've seen teams build towards that definition and delivery of that comprehensive data architecture?

Jamie Knowles

Yeah. So I think the most effective teams treat data architecture as a living product. It's not a one time design. So they start with small business owned logical models. They let it evolve alongside delivery. Some of them will anchor everything around shared semantic morals that feed their warehouses, their metric layers and even their AI tools. So meaning

is defined once and reused everywhere. That's the the core of it. I think what's unexpected is how lightweight this can be. It doesn't have to be something gargantuan and complicated. Simplicity is the key. The innovation isn't building huge fancy patterns, it's just clear business models that act as a connective tissues that lets everybody move fast without losing alignment. Simple as that. And in your own experience of designing those data architectures, working in the space,

working with teams who are trying to figure out how to capture and represent that data modeling and that core architecture? What are some of the most interesting or unexpected or challenging lessons that you learned in the process? Again, I think the hardest lesson is that it's not a technical problem. It's all about understanding

communication, ownership and incentives, about working with humans and understanding how the business works. So problems are less often caused by bad decisions, but by probably by missing decisions. When you're not making meaning

explicit systems drift, you've got to do the hard yards up front or everything else afterwards is is just risky. And I think the most durable architecture really the most sophisticated, they're the ones that stay simple, business aligned, and easy for people to understand.

Tobias MaceyTobias Macey

And as the ecosystem continues to both embrace AI tooling as well as the growth of complexity in terms of the requirements, the number and variety of data sources, the investment in data as a core operational asset.

As all of those things continue to grow and build? What are some of the ways that you're seeing teams address the needs of the moment and what are some of the predictions that you have as far as how maybe not the nature of the work but the realities of the work will change going forward?

Jamie Knowles

Yeah. So AI is the key. AI is going help us at all levels, but big old caveats. Again, as we as we said before, there's there's no silver bullets. We've still got to go through the same sort of process. AI will change us how we think about architecture, but it exposes the gaps. It's going help us with our design work and transformations and delivery, but only works well when meaning is clear. The definitions are fuzzy or inconsistent. AI is just going to move around, the problem faster.

So I'm not giving you any real silver bullets here. It's all everything comes back to the same point. We've got to understand the meaning.

Tobias MaceyTobias Macey

And are there any other aspects of this overall problem of data architecture, the ways that it impacts the actual day to day work of data engineers or just the overall challenge of getting that communication flow between the business and the technical teams that we didn't discuss yet that you'd like to cover before we close out the show?

Jamie Knowles

No. I think we've covered it. It's that, as you say, that process of working with the business, understanding how the business works, have been able to have a conversation with the business. You've got to do the hard yards up front, agree the structure and the meaning of the data in play, involve your data architects certainly and make sure that you're focusing on on meaning. It's not just about physical data models. It's about understanding business meaning, making it explicit.

And, yeah, we're using the term knowledge modeling a lot, so model that knowledge.

Tobias MaceyTobias Macey

I guess in terms of that capturing of business meaning, I briefly alluded before that there are various ways of framing that conversation. In your experience, what have you found to be some of the most effective ways of getting the business to actually engage and surface the detail that you need in a way that you can actually then take and translate it into the delivery of the technical requirements.

Jamie Knowles

Well, that's a nice easy one. Yeah. Using a good tool like ER Studio. So creating

nice data models as pretty pictures. So having having these conversations in in sort of words and conversation is really hard. Pretty pictures is is it makes it a lot easier. So being able to sort of lay something out as a diagram and show it to a business person, a logical data model, anybody should be able to understand it. And it's it's a really good sort of straw man to to beat up. So use the tools, use the models, use your data architects.

Tobias MaceyTobias Macey

And in terms of getting to that initial data model, what are some of the useful questions that you have found to be able to get to a point of having enough context to be able to even get to that first draft of a model?

Jamie Knowles

The traditional approach is conceptual models, logical models. Start off simple, rough out the concepts. Going through the different domains of the organization, working with HR. Okay, get all the HR guys in a room, talk me through what are the things that are important to HR, the notion of an employee and employment contracts. Let's rough out, make a list, what relates to what. It's good old fashioned data modeling that was born in the 70s and is still valid. Nothing's changed.

Tobias MaceyTobias Macey

All right. Well, anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Jamie Knowles

I think it's gonna be leveraging AI, and it and it's a really interesting point in time. So we're seeing a lot of our customers as we're adding AI features into the tools So the world, you know, we can't have any kind of AI capabilities in in the estate. It's like the old days of, of the Internet. Would you put your credit card details into the Internet? No. I wouldn't. Now look at us. So I think the biggest opportunity

for us at the moment is AI. So how can AI speed up these processes? So as a tool vendor, this is something we're focusing on. How can can AI make this stuff easier, quicker, etcetera.

Tobias MaceyTobias Macey

Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences

of this overall challenge of data architecture, the importance that it plays, and some of realities about how to actually get to a point of having that architecture in the first place versus just whatever happens to evolve from the series of requests that a team gets. So I appreciate all the time and energy that you're putting into helping organizations tackle that challenge, and I hope you enjoy the rest of your day. Pleasure. Thanks for having us.

Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android