From Data Chaos to Clarity - Unraveling Data Engineering with Blake Burch - podcast episode cover

From Data Chaos to Clarity - Unraveling Data Engineering with Blake Burch

Mar 07, 202420 minSeason 2Ep. 12
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

In this episode, we talk to Blake Burch, a software engineer with a keen focus on the often overlooked but crucial field of data engineering. Throughout the conversation, we explore the challenges and solutions in making data manipulation and infrastructure setup more accessible and secure for teams. Burch shares insights on the evolution of data tooling, the importance of data observability, and strategies for building trust within organizations through effective data management. We delve into the pivotal role of domain knowledge in data teams and discuss Shipyard's approach to simplifying data workflows for businesses.

Transcript

Welcome to Tectastic, the podcast that explores the cutting edge world of technology and its impact on society. New breakthroughs and developments are revolutionized the world around us, presenting exciting opportunities as well as complex challenges. We'll explore the big ideas and key players driving these transformation as we seek to understand the implications of these advancements for our lives, our communities, and our planet.

Join us on this journey of discovery and exploration as we navigate the fascinating and ever evolving world of technology. This is tantastic. It is so lovely to have you here. Great to be here. Thanks for having me. Yeah. So you've got an interesting product, it sounds like. You're focused on one of the red headed stuff of technology landscape on data engineering. And as a software engineer at my core and as a senior leadership that's moved in that space, I I know I neglected it.

We would give our teams tools that allowed them to go write code to move data as they needed, but we didn't make it easy for them. Not because we didn't want it to be easy for them, just because it was not front of mind. It wasn't the most important thing. And it sounds like that's what you're trying to solve. Yeah. It's a, it's a common issue the space.

And it's been interesting to see, like, the cycle of the industry, even just over the past 4 years, where the kind of easier to use tools at the point to point solutions, started coming out there that would help you just load the data somewhere, just transform the data, just send data from a warehouse out to your SaaS tools.

But we're starting to see kind of a shift of, okay, we don't wanna have 10 bazillion tools in our, organization because it means we don't actually get any visibility into, like, how things are moving and when this data set is created and transformed, what reports it's gonna it to, what dashboards it's connected to, what, sort of like machine learning models it might be connected to.

And so our big goal over at shipyard is just trying to provide an all in one solution to where you can connect to each of these different touch points of your data together, in a low code fashion, but we still let you run your own code because inevitably you're gonna have to do that. We can only get 80% of the way there with low code, but for proprietary use cases, you can help make sure that those scripts are it together to other tools that are moving and manipulating your data.

Related to that, and and it's really close to what you're doing is the infrastructure because so often the data movements and all that have a fairly large impact on your, your infrastructure. Yeah. So the way that we kind of think about the infrastructure on it is our goal is to abstract it entirely for companies. So we're not the best fit if you're trying to have things that are self hosted on your own servers, but we make sure that everything is fully scalable.

It's all on AWS on our side, but we choose to be kind of like vendor and cloud agnostic. So most of our temp that we have, we have all of your standard, AWS storage and database things like s 3 or Redshift. We have BigQuery and Google cloud storage, we have Microsoft SQL Server and Azure blob storage. And so it doesn't really matter the infrastructure that you end up using on your side.

Ends up being modular, because I found in a lot of my experience that data teams are oftentimes being put in charge of setting up infrastructure, and that's not their skill set. Like, they're not doing it super well. It's not super secure. Like, why not just offset that to someone else so they can focus on the real, like, problems that they're good at, like, solving problems with data. The reason I ask on that front is we're kind of in the middle of that space.

Well, it's not a security tool, but security is a symptom of the thing that we're trying to solve. Infrastructure is not the problem, but it's a symptom of the problem that we're trying solve.

And so we're we're finding ourselves as this go between the technology solutions you have, the code you've written, the business requirements, because those change, And this is the thing I said about tech debt that I think actually ends up being true with data and everything else is tech debt is inevitable, and it's just a factor of time. It is nothing else. Right? Time causes people to lead your team over time, new vulnerability to discover, patterns change, etcetera.

And part of that is also infrastructure. There's cheaper, better, faster ways of doing things. The problem is your solutions as a whole cannot change at the same rate that the entire industry and everything else is moving. And so our job has become like a time machine for the organization to look back say what have you done? Why did you do it? And can we now get you where you want it to be today? And data is a place that we've ignored. To us, it's almost the infrastructure problem.

Do I become it in between for whatever the hell you're doing with whatever technology you want? Yep. But those patterns change too. And in reality is even, like, one of the companies I was at was just a data platform. We were just moving data between all the various, logistics companies on Earth, the ocean carriers, the World Bank, everybody.

And the fundamental problem there is there is not a model that you can say a container is a container ship as a ship or a carton as a carton, like, it's all radically different. Yep. So you you become the go between layer that says, oh, when you said container, they mean whatever object. Right? You do have transforms. And that to me is largely what the data infrastructure layer most companies ends up being is just a transform layer between different point solutions. I've got a CRM.

I've got a CMS. I've got a etcetera, etcetera. Yep. You hit the nail on the head. Like, a lot of the work on the data side is just trying to make sure that the information will line up so that it works for some other system or for some other process or some other report. But what I found is that, like, a lot of data teams don't actually have a way to track how their data is being used to what inevitably happens.

Yeah. They they change a column where they change, like, the structure of how the data gets sent. And Oops. That broke 3 systems, but the data team didn't know that. Now they have, like, the business team coming and knocking on their door and trying to figure out, hey. What went wrong. Like, all this stuff is broken and I, oh, I didn't know. And so, like, that that is something we're trying to solve. Like, trying to have the observability to know power each of these pieces connected together.

So you could at least verify, hey, if this one step changes or breaks, look at all these connecting arrows, maybe you should look at these or maybe those should be prevented from running and refreshing until we, like, address the logic. I think just that lack of visibility seems to to be a big problem on it, on the data science. Huge problem. I I I'm gonna give you the worst example I ever saw. So the, I'm trying to remember the technology that they use for everything.

They they didn't give really a good way of migrating data around or moving around. So they used god. I can't remember the tool. It's it's really meant for taking and packaging up, like, a JavaScript file or something like that. And then, executing it and then kind of throwing it away. And that's what they used for moving all data around, and they were doing that to do transforms on data to, like, take it and put it into a different system. And here's what went crazy.

One system at the beginning, let's say it was the system of record for a particular data object. It would publish it then a bunch of other things would listen to that. They would do their own transform, stick into their own system, and then publish back out their new objects. And then it would go all the way around until the original system of record was taking in data from another system that was just a mutated bastardized form of their own original data, Okay.

It's starting the whole thing all over again. Cool. Sounds great. Yeah. I mean, that sort of stuff can Hammer. Like, all the time. Like, an example I had from the last role where I was leading data teams at a digital ad agency is we work with a lot of e commerce brands. So, like, Sephora and gaps of the world and everything else.

And we're getting, like, second hate and files of all of their product information, and we're having to transfer and format that data that it can look good on Google and Bing and anything else like that. But sometimes the way that we were classifying stuff, they need to know what that looked like on their side. So we're sending them their data transformed back to them, the information. It just happened public time. I'm gonna bomb it yeah.

So, yeah, I am very familiar with big data transforms and moving it around in that gross disaster that happens. And it's a profoundly difficult problem. The question that I have though is because it's often treated as that red headed step child of the tech industry, How do you get the people that are making the purchasing decisions to care, or did you take a different go to market strategy than trying to go straight to the top?

Yeah. We actually started to try and do bottoms up, initially trying to get the actual practitioners that are going to be doing the data work actually trying to move it and set things up and, like, helping them understand, like, the way that data workflows could be up in a way that it is more sustainable to where, like, you actually have some sort of air management. Like, you know, when something breaks, why it broke, and it prevents other stuff from happening.

And I think one of the recommendations that we give for a lot of clients that really seems to resonate is, the the whole issue you where, like, what I was talking about earlier where the business team comes knocking at your door. Like, that's such a common problem where the data team doesn't actually realize something was wrong until, like, 3 days later, and it just loses trust. And that is literally the worst thing that can happen in your organization when nobody trusts the data.

Because at that point, even if it is a 100% right, you don't have a way to prove to people that it's 100% right.

So the model that we typically recommend the is like, Hey, as you're setting up these workflows on shipyard, like, if something errors out, set up a, like, series of steps that will automatically generate, like, a Jira ticket to, address the issue and, like, what the error message is and then automatically send a Slack message to the affected teams because then your process is not just deploying the data successfully. It's also successfully letting people know that it is aired.

You are currently working on it, and that is, like, the biggest way to build trust in there. And I think just there's enough teams that aren't doing that, that it's a it's a big opportunity. Yeah. This is a huge problem. The trust thing came up a lot, at a former large company that moved a lot of furniture around the world. The worst example of that was we were giving them the feed. I know this is what's on the shelf.

I know this is where these things are at because in the system, this was just scanned. I can show you the event. And the team that was responsible for, like, the cascading series of things after that didn't trust it because they had a different number, and we couldn't show how those numbers related. And it was like, the whole problem for us was it was like, well, we're literally this source of record. I don't know what happens after that. I don't know how you're getting your number.

But this is the the literal here's the gun pointing it and scanning it system.

Yeah. I mean, it's no different than, like, the organizations where the data team will actively deliver a dashboard, and nobody uses that dash they just click the download button, put it into Excel, and then make their own numbers and stuff off here, which is so troublesome as a data team, because how are you supposed to, like, make sure that the right information is being shared among leadership and stuff like that, but I think it does stem from, like, a

lack of trust or, like, I'd rather do it myself so I know how this number was created or how I got to it. It's an interesting thing where I had the fortune of the last company that I was at, where the data team was actually responsible for doing 50% of like, account management work at the same time, because we were trying to figure out, okay, what levers can you actually pull to, like, change bids and budgets to create ads to update audiences to turn things on and off based on inventory.

And that, like, required some sort of domain knowledge. Along the way. But, like, the thing that I see for a lot of data teams is that they spend so much time and effort trying to get everything set up and looking nice and neat. And, like you said, things change all the time. Your requirements change. The data sets change. I I almost find it to be better to try and, like, look at business holistically and figure out, okay, what are the things that we can help move the needle on?

And rather than trying to set up 25 different data sources that have 30 columns each and make sure that it's all accurate. Let's focus on the 2 that we can connect to this very specific action that the business can take off of it. And then that way if things change, you're not having to worry about all this extraneous stuff. You can keep on adding more data, but we got to a point where the the industry was all about big data, and everyone wanted all the data.

Well, everyone stored everything, and they're spending all this time and effort on making sure everything is accurate and clean, but, like, 20% of it gets used to just focus on the 20% initially. That's such a good point. I don't even think that, most companies understand that they've been told for long enough now that your data's valuable. Yep. And they're like, oh, data's valuable. It's all gold. Store it all the way. Put it in a vault. Like, we'll figure to do with it later.

And it's like, no, no, no, no, no, no, some of your data is valuable. Some of it's probably very valuable to you, And it's not about storing it in a gold ball, like, in a vault where someday I'm gonna materialize this thing, and we're gonna be able to turn our entire business into a data company. It's like, no. No. Valuable to you to act on. You have opportunities in market because of some signal that you're getting and you're not doing anything with.

You've got debt inventory on a on a warehouse that's costing you money somewhere that you could liquidate and whatever. You've got data that's valuable to you might not be valuable to anybody else. In fact, it probably isn't, except for maybe your competitors. Right? The the things that are important to you, your competitors would love to have that data. But otherwise, It's not. So to your point, find what is and act on it. Use it.

Yeah. Which ultimately requires data people to have some sort of domain knowledge or domain people to have some sort of data literacy. I mean, it it goes both ways Yeah. But I I feel like it's not talked about nearly as much. And so instead, it's just boarding as much data possible. And then business people thinking like, oh, we have data. We should be able to get this.

And it's like, well, no. Because we didn't really think through what this problem was, what we could do about it, and what data would relate to that in the first place. So There's so many parallels between that and, like, software dev. Right? If you have a software engineer that just goes off and build something without any domain knowledge, up of the wrong thing. Same problem.

And the the way that you always try to rectify that is either you put a translator in between them and the business called a product manager whose job is like, I have enough domain knowledge, and I know who to talk to, and I have enough engineering knowledge, and I have a team that I can translate it, which I'm not a big fan of. Yeah. I find that that's just the game of telephone.

You've just inserted somebody that's gonna not be good enough engineer and not know enough about the domain to be a good translator. Mhmm. The best thing to do is to literally get the soft where engineers in the room with the domain people and ask questions. Now that can be facilitated by product manager. So to me, it looks a lot more like hackathons, Frank, where somebody comes in and they're like, I have this idea or in the case of a business, I have this problem. I have this need.

I have this want. And a team being available to step forward say, okay. We're gonna figure it out. Let's go figure out what we need. What data is available? Where's it coming from, etcetera? Yeah. In a big company that's hard, right, because how do you get those two parties together? You gotta structure that. That's what agile is actually supposed to be. That's what the manifest is about. It's getting those two parties together. Right? But it's very difficult in practice Yeah.

I would say another thing that I've seen be successful in some instances is just having your data team do some sort of, like, shadowing.

Like, It it might not be having someone talk about the problem, but if a technical person sees what someone is doing on day to day basis and then can ask questions out of that, could, like, bring up new opportunities that the business person didn't even know to ask or didn't think was a problem, that could result in some sort of business value, but that doesn't have often times. It's mostly, I want this. How do we get there?

And then everyone's just trying to, like, meet the the demands that were initially given rather than trying to figure out root issues or other, like, smaller things that may have gone, under the radar that can be fixed with the data. What we did at Nike works really well, I thought. And it was kind of a combination of that hackathon type thing, because there was events where we would pull a bunch of people together and say like, hey. You've all submitted all these great ideas.

We've got a bunch of teams of engineers. We've got a couple of weeks. We're gonna some prototypes. Yeah. And having people's whole job that was just to go into the business and look and say, like, where are we running into problems? And the best way do that was to identify areas that were having trouble. Yep. They're not meeting whatever quota they Hammer, their KPIs or whatever it is. They're they're they're something wrong there.

Send in the consultants, and they're not consultants, really, so much as they're investigators. They're going in and going, what do we not know that we need to know what's going on and then them to come back to the technology team and say, okay. What we found is this. Let's go engage and go deep with it. I mean, that's actually the methodology of, like, some of the very large successful long term tech companies have been around for, like, a century.

I'm thinking of a couple really, really big ones, but I can't remember their names at the moment, but that's they're in, like, chemical engineering and in that type of space. That's effectively what they do is they've got people's whole job as just go find the next biggest problem we to solve and then tell us what it is, and then we will pull together engineering team to go solve it.

That actually sounds like a a a really is interesting role to try and see, like, how you could match with some of the the data roles, like, an an investigator of sorts.

Yeah. Like, if you had a tool, which it sounds like shipyard might be, that gave them visibility into where the data is being used, what systems are using, and how they're using it, and gave them the ability to have, like, version thing on it because that's actually another issue, right, is if it changes and I don't change the match with it, but that's when things break that breaking change. Right?

But you need to be able to change your data because you've got some use case that needs it in a different form. So being able to maintain state on multiple versions for a period of time becomes a critical feature for most production systems. Yeah. That's a big reason why DBT ended up getting very popular in the space for data transformation because it was really training data analysts to, hey, use SQL to make the tables and views that the needs.

But by the way, this sequel is going to be stored on GitHub so that everything is version controlled. So if you have to update the definition of the table review, we know exactly what it was so we can see when that change was made. And, like, very similarly, even for us, when you run stuff on the platform. All of our logs, we actually tie each log to a specific version number that you can then click and see, okay. What was the definition of that workflow for that version number.

So if all of a sudden you're getting a bunch of red x's, you can see, oh, here's the disc between that configuration file. This thing was that probably is causing the issue. But, yeah, that's hugely important, starting to become more and more prevalent in the data space, which is great. Blake, it was a pleasure having you on. I'm gonna give you a chance let everybody know where they could check out Shipyard? Yeah. So you can find us at, shipyardapp.com.

And we also have a weekly newsletter called All Hammer on data. That's substack that you should subscribe to. It's a great resource for up to date information. Thank you so much. Appreciate it. And that's a wrap for this episode of Tectastic. I wanna thank you personally for joining us, and we'll see you next time. Until then, keep exploring and stay curious. Hey there, tech Christian. Is your team drowning in tech it and just wish you had a magic button to fix it.

Wanna introduce you to Vala AI, your tech debt hero. At Vala AI, we get it. You're busy. That's why we've made fixing tech challenges as easy as a click of a button. You don't need to be an engineer. We empower non techies to conquer complex tech issues effortlessly. We understand you don't have time for tech headaches. Valla AI is here to lift that tech burden, making your tech debt disappear with a simple click. So ready to say goodbye to tech troubles, try Vola AI.

Your solutions are just a click away.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android