Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets - podcast episode cover

Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets

May 12, 202624 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

A comprehensive guide by Andreas François Vermeulen designed to help organizations convert raw data lakes into valuable business assets. It outlines a sophisticated Data Science Technology Stack that includes powerful processing and storage tools like Apache Spark, Kafka, and Cassandra, alongside programming languages such as R, Python, and Scala. The author presents a structured layered framework and the HORUS methodology to streamline data transformation through a hub-and-spoke approach. To ground these technical concepts, the text establishes a fictional corporate group, VKHCG, providing realistic datasets across sectors like logistics, media, and finance. This framework emphasizes moving beyond simple data wrangling toward a Center of Excellence model that ensures scalability and operational efficiency. Ultimately, the sources serve as both a theoretical roadmap and a practical manual for mastering the end-to-end data-to-knowledge cycle.

You can listen and download our episodes for free on more than 10 different platforms:
https://linktr.ee/cyber_security_summary

Get the Book now from Amazon:
https://www.amazon.com/Practical-Data-Science-Building-Technology/dp/1484230531?&linkCode=ll2&tag=cvthunderx-20&linkId=41e96f1f6d23f742302cb82466c28372&language=en_US&ref_=as_li_ss_tl

Discover our free courses in tech and cybersecurity, Start learning today:
https://linktr.ee/cybercode_academy

Transcript

Speaker 1

Welcome to another deep dive for you, the learner listening in today. I want you to just imagine standing on the edge of a massive, wildly turbulent.

Speaker 2

Ocean, like a really chaotic one.

Speaker 1

Yeah, exactly. But we're looking at a global landscape generating over forty zetabytes.

Speaker 2

Of data, which is just an unfathomable number, it really is.

Speaker 1

And the modern business challenge it isn't acquiring information anymore. The actual challenge is preventing your enterprise from drowning in these raw data swamps. You know, It's about figuring out how to build the industrial plumbing necessary to refine that total chaos into pure actionable business assets.

Speaker 2

Because the physics of a forty zetabyte landscape they completely break traditional data models. Oh for sure, human cognition and frankly, legacy server architectures they just aren't built to natively comprehend or route that much throughput.

Speaker 1

No, they would just melt pretty much.

Speaker 2

You can have the most valuable data on the planet sitting in your servers, but if you're processing staff can't ingest it, structure it, and analyze it at scale, it actually becomes a massive liability.

Speaker 1

Instead of a competitive advantage.

Speaker 2

Exactly.

Speaker 1

Okay, let's unpack this today. We are analyzing a foundational text to solve this exact problem, which is practical Data Science by Andreas Francois Vermulin.

Speaker 2

And this isn't just a theoretical text point.

Speaker 1

No, not at all. It's an aggressive, really comprehensive guide to the entire enterprise technology stack.

Speaker 2

Yeah, the layered frameworks, the rigid business rules, all the stuff required to actually tame massive data sets out in the wild.

Speaker 1

Right, Because what Vermulin offers is essentially an architectural blueprint.

Speaker 2

We are moving way past the novelty of you know, simple data science experiments.

Speaker 1

On a laptop, like just running a quick Python script.

Speaker 2

Right. This text breaks down the mechanical reality of how data is stored, processed across distributed clusters, legally protected, and ultimately served up to an executive board.

Speaker 1

To drive millions of dollars in decisions.

Speaker 2

Exactly. That's the end goal.

Speaker 1

So our mission for this deep dive is to give you a cohesive mental model of that entire.

Speaker 2

Journey from start to finish.

Speaker 1

Yeah, we'll track the data flowing from a wild unstructured lake through all that complex processing machinery, all the way up to business deployment.

Speaker 2

It's quite a journey.

Speaker 1

So let's start at the source, right, taming the wild reservoir. Fermulen defines the data lake as a massive repository storing data in its native raw format.

Speaker 2

Which is crucial to understand.

Speaker 1

Right because for anyone who has worked with legacy systems, we know the absolute friction of the old schema naw ride.

Speaker 2

Approach ugh schema on right. It basically forces you into a rigid box before you even begin doing anything.

Speaker 1

You have to map everything out.

Speaker 2

Yeah, you have to spend months modeling the exact ship of your database tables, the data types, the relationships, all before a single bite is even loaded.

Speaker 1

And that rugenity causes massive bottlenecks.

Speaker 2

No, absolutely, because the moment a new unexpected data format arrives from an external vendor, what happens.

Speaker 1

The whole ingestion pupline just breaks down, shatters. That's where the modern schemon read philosophy comes in. You bypass that initial bottleneck completely by loading the data into the lake exactly as it.

Speaker 2

Is, just raw and completely unstructured.

Speaker 1

Yeah, you only apply the organizational rules the schema at the exact computational moment you query the data. Yes, so is a data lake essentially a giant unfiltered natural reservoir, And schema on reed is like deciding whether you want to filter that water for drinking, farming, or swimming only at the exact moment you dip your bucket in.

Speaker 2

That is a perfect analogy. What's fascinating here is how that flexibility directly accelerates knowledge generation. How so well by keeping the leaf level atomic data perfectly intact, you preserve all the anomalies and the really subtle signals.

Speaker 1

Uh, because he didn't scrub them out of the start exactly.

Speaker 2

It's in an exploratory data science. The actual insights are hidden in the unstructured noise.

Speaker 1

Right.

Speaker 2

If you force data through a rigid schema on right filter right out ingestion, you strip out those anomalies because they just don't fit your predefined assumption.

Speaker 1

You lose what you didn't know you were looking for.

Speaker 2

Precisely, Schema on reed preserves those unknown variables for future models.

Speaker 1

But Vermilan makes it clear you can't just leave everything floating in a chaotic lake forever.

Speaker 2

No, that would be a disaster.

Speaker 1

Right enter the data vault, which is a hybrid modeling methodology created by Dan linst.

Speaker 2

It because we do need structure for business reporting, but we want it without losing that agility.

Speaker 1

So the Data Vault achieves this using three core architectural components, right, hubs, links, and satellites.

Speaker 2

Yeah, the mechanical genius of the Data Vault is its modularity. Hubbs act as the immutable business keys.

Speaker 1

Like the absolute core identifiers right.

Speaker 2

Like a persistent customer ID it never changes, okay, and the links links handle the trends actional associations. They map how hubs interact without holding any descriptive data themselves.

Speaker 1

Got it, So where does the actual information go?

Speaker 2

All the volatile descriptive context is pushed into the satellites.

Speaker 1

So if the hub is the unchangeable concept of a specific customer and the link represents the fact that they interacted with a specific product.

Speaker 2

The satellite holds their current address, their income bracket, and the timestamp of the event.

Speaker 1

Wow, why split it up so aggressively like that?

Speaker 2

Because it isolates structural changes. Let's say your marketing department suddenly starts collecting a dozen new demographic metrics on customers. With a normal setup, you'd have to rebuild your core tables or alter existing schemas. But here you simply attach a brand new satellite to the existing hub. Oh wow, Yeah, it allows you to model incredibly complex, evolving enterprise environments while maintaining a completely auditible historical record of every single change.

Speaker 1

That's brilliant. Okay, so now we have a highly structured, scalable reservoir. But a reservoir is useless if you don't have the industrial machinery to pump and process the water.

Speaker 2

Very true.

Speaker 1

Let's move into the processing stack for Mulen outlines. At the absolute center of this arsenal is Apache Spark.

Speaker 2

Spark completely changed the paradigm for distributed cluster computing because it's so fast, because it's resilient. When you are analyzing terabytes of telemetry data, a single machine's memory will inevitably crash.

Speaker 1

It just can't hold the weight, right.

Speaker 2

Spark solves this by utilizing resilient distributed data sets or RDBs.

Speaker 1

Okay, what do those do?

Speaker 2

It basically shatters the massive data set into partitions, distributes them across thousands of worker nodes in a cluster, processes the math and memory all at the same time, and then aggregates the results back together seamlessly.

Speaker 1

That is, Wild and working alongside Spark is apatche Kofka. If Spark is doing the heavy computational lifting, Kafka is handling the sheer velocity of the ingestion exactly.

Speaker 2

Coff operates as a distributed published, subscribe messaging system like a massive router. Yeah, imagine you have a global retail operation. You've got thousands of edge devices, website clicks, supply.

Speaker 1

Chan updates, generating millions of events per second.

Speaker 2

Right, Kafka ingests that entire stream. It guarantees fault tolerant real time delivery to the processing.

Speaker 1

Core, so nothing gets lost exactly.

Speaker 2

It ensures no packets are dropped even if a downstream server briefly goes offline.

Speaker 1

Here's where it gets really interesting. If we look at the programming languages. Okay, we all know Python and OUR the standard languages for data science. Sure, but if Python and OUR are the cognitive centers, like the brains running the logical models are Kafka and Spark basically the central nervous system ensuring the signals actually travel through the giant corporate body without collapsing.

Speaker 2

That analogy perfectly maps to the technical architecture.

Speaker 1

All awesome.

Speaker 2

Yeah, Python is exceptional for logical wrangling, right, but Native Panda's data frames are heavily constrained by single machine memory limits they max out exactly. And similarly, R is unmatched for statistical rigor. It creates complex visualizations. With libraries like gg.

Speaker 1

Plot two, you can't easily scale it.

Speaker 2

Right. To apply that statistical rigor to a forty za by ocean, you need.

Speaker 1

A bridge, which is where the tools come in.

Speaker 2

Yeah, that's why Vermulin highlights packages like spark Layer. It allows data scientists to write standard R code that executes natively across a massive spark cluster. Oh, I see the distributed tools free the analytical brains from their single server skulls.

Speaker 1

That's a great way to put it. And we can't ignore the edge devices feeding the system either. The text specifically highlights mqtt MQ telemetry Transport.

Speaker 2

Of really vital protocol.

Speaker 1

Yeah, because if you have an incredibly dense array of IoT sensors, say monitoring temperature fluctuations across a massive agricultural grid, standard HTTP protocols carry way too much header overhead. They're just too bulky, RIGHTMQ uses a microscopic footprint. It's the perfect protocol to shoot continuous low bandwidth telemetry data directly into your Kofka streams.

Speaker 2

And mastering that integration. Knowing how to capture lightweight MQTT signals at the edge, stream them flawlessly through kofka, crunch the distributed math with Spark, and orchestrate it all with Python MUD.

Speaker 1

That's the real trick.

Speaker 2

Yeah, that is the exact threshold that separates a local data analyst from an enterprise grade data scientist.

Speaker 1

Okay, but having a garage full of state of the art tools doesn't mean you actually know how to build a functional car.

Speaker 2

No, it definitely doesn't.

Speaker 1

We have the stack, but we need a blueprint which brings us to the processing frameworks required to manage these deployments without, you know, causing catastrophic failures.

Speaker 2

Because the industry graveyard is completely full of brilliant algorithms that died in production.

Speaker 1

Why do they die.

Speaker 2

Because there was no standardized engineering process for Meal and champions CRISPDIUM, which stands for the cross Industry Standard Process for Data mining. Right, it breaks the workflow into a really strict sequence business understanding data, Understanding data preparation, modeling, evaluation, and deployment.

Speaker 1

It seems like jumping straight into modeling without the business understanding layer is exactly why so many data pilots fail when they hit the production floor.

Speaker 2

Oh one hundred percent. And the text emphasizes that CRISPDM is inherently cyclical, not linear.

Speaker 1

Right, You don't just march from step one to six and clock out for the day.

Speaker 2

Far from it. The cyclical nature is a defensive mechanism against bad assumptions. Well, you might spend weeks in the modeling phase only to hit the evaluation phase and realize your predictive accuracy is hovering around.

Speaker 1

Fifty percent, basically a coin toss.

Speaker 2

Right, That failure forces you back to data preparation to engineer new features, or sometimes all the way back to business understanding because the original problem was framed incorrectly.

Speaker 1

Wow, And to operationalize this cycle at scale, Vermulen outlines a five layer data science framework. Yes, he grounds the using a fictional corporate sandbox called VKHCG, the vermil and Quent, Vulner, Hillman Clark Group. It's quite a mouthful, it is, but it's a massive conglomerate with distinct subsidiaries handling it, networks, global billboard, advertising, logistics, and four X trading.

Speaker 2

It serves as the perfect stress test environment for the framework.

Speaker 1

So how did the five layers stack up to manage this complexity?

Speaker 2

At the apex is the business layer, which dictates the actual enterprise needs.

Speaker 1

Okay.

Speaker 2

Below that, it's a utility layer, which is a centralized vault for repeatable algorithms.

Speaker 1

Got it.

Speaker 2

Then the operational management layer handles scheduling and automated triggers.

Speaker 1

Like running the jobs right.

Speaker 2

The audit balance and control layer strictly monitors data lineage in compliance super important. And finally, the functional layer at the bottom is where the actual algorithmic heavy lifting and data transformations execute.

Speaker 1

Looking at this architecture, it becomes painfully obvious why so many data pilots fail? Oh yeah, A data scientist will build a brilliant predictive model in a Jupiter notebook on their local machine.

Speaker 2

Which is effectively operating purely in the functional layer.

Speaker 1

Exactly, But when they try to deploy it across an enterprise like VKHCG without the operational management layer to schedule the pipelines or the audit layer to monitor data drift.

Speaker 2

The model immediately fractures under real world condition, it just shatters. Yeah, if we connect this to the bigger picture, the primary value of the five layer framework isn't merely bureaucratic organization.

Speaker 1

What is it? Then?

Speaker 2

It provides the architectural scaffolding required to transition a localized, fragile experiment into an automated, fault tolerant production environment, making it real exactly. A model without operational integration and continuous auditing is effectively useless to the broader enterprise.

Speaker 1

Speaking of the brighter enterprise, let's look at the sheer logistical nightmare of a conglomerate like VKHCG.

Speaker 2

It's massive.

Speaker 1

Yeah, you have Crenwolner ag generating video files and high rise images from billboards. Clark Ltd is generating thousands of csvs of four X trading data. Hillman Ltd Is producing XML routing data. So much variety, right, So how do these distinct layers and subsidiaries communicate without drowning in an endless se of custom translation?

Speaker 2

APIs that integration bottleneck is solved by the utility layer, specifically through an architectural standard Vermulin introduces called Horus.

Speaker 1

Which stands for the homogeneous ontology for recursive uniform schema.

Speaker 2

That's a one.

Speaker 1

It's essentially a universal internal adapter. Let's break down the actual mathematics of why this is necessary, because the technical debt of point to point integration is just staggering.

Speaker 2

It really is. Let's hear the math.

Speaker 1

Okay, if an enterprise has one hundred different data formats and you want any system to talk to any other system, you have to write direct converters for every single combination. That's one hundred times ninety nine. You're looking at nearly ten thousand custom brittle integration scripts just to maintain baseline communication.

Speaker 2

And every time and ex journal vendor updates and API dozens of those point to point scripts break simultaneously.

Speaker 1

Which is a nightmare for the engineers.

Speaker 2

Absolute nightmare. But by instituting Horace as the central hub, you mandate that every incoming format is translated into the HORROR standard.

Speaker 1

First.

Speaker 2

Okay, if a downstream system needs that data, it translates it from HORUS into its target format.

Speaker 1

Wait, wait, I want to push back on that architecture for a second. Sure isn't translating Format A into HORUS and then Horruce into format b Aren't we just injecting a middleman into every single data pipeline. Doesn't that intermediate step add massive computational overhead and latency? Why is this actually faster in the long run.

Speaker 2

It's a really critical trade off. Yes, you introduce a fractional computational cost by serializing and de serializing through an intermediate cema. There is a cost, but consider the alternative by using a hub and spoke model. Integrating one hundred formats only requires two hundred scripts, one to convert HORUS and one to convert out.

Speaker 1

That is a huge difference.

Speaker 2

It's a ninety eight percent savings in development time. When Format one oh one is introduced, you don't write one hundred new integrations, you write exactly too.

Speaker 1

Wow, Okay, that makes perfect sense.

Speaker 2

The microscopic increase in compute latency is heavily outweighed by the elimination of thousands of hours of developer maintenance and pipeline fragility.

Speaker 1

And HORUS isn't just for tabular data either. The text provides some wild examples of how the utility layer forces complex unstructured data into this homogeneous format.

Speaker 2

Yeah, the image extraction is crazy.

Speaker 1

It really is yeah, for meal and details. An algorithm that takes a JPEG image of a dog named Angus, great name, and it extracts the exact red, green, blue, and alpha transparency values for every single.

Speaker 2

Pixel, just tearing the image apart.

Speaker 1

Yeah, and it flattens the entire visual into a massive data frame of raw numerical arrays. And he applies the exact same logic to MP four video files, extracting frame by frame matrices.

Speaker 2

Right, because by mathematically flattening complex visual or audio data into a standardized horror structure, you allow standard machine learning libraries to process it because.

Speaker 1

They usually need tabular numerical inputs.

Speaker 2

Right, exactly, Now they can process a video file using the exact same underlying logic they would use to analyze a financial spreadsheet.

Speaker 1

That is mind blowing it.

Speaker 2

Is, And because it's stored in the utility layer, any engineer across the enterprise can call that verified image extraction algorithm without having to reinvent the mathematical wheel.

Speaker 1

Which brings us to the final and unequivocally most critical piece of.

Speaker 2

The framework, the top of the pyramid.

Speaker 1

Right, we have the data lakes, the spark clusters, the CRISPA DM blueprints and the horrors universal translators. But all of this flawless engineering is absolutely worthless if it solves the wrong human problems.

Speaker 2

Totally worthless.

Speaker 1

We have to ascend to the top the business layer.

Speaker 2

This is where non technical functional requirements actually dictate the engineering parameters. Right Vermulin leans heavily on the Moscow prioritization method.

Speaker 1

Here Moscow that must have, should have, could have, won't have exactly.

Speaker 2

It forces stakeholders to brutally separate mission critical analytical needs from purely aspirational vanity metrics.

Speaker 1

And you have to do that before single line and code is written Precisely. Once those strict requirements are set, the business logic has to be modeled. The text introduces sun models, developed by Mark Whitehorn to handle this mapping.

Speaker 2

Sun models provide a phenomenal way to separate business facts from context.

Speaker 1

How do they work.

Speaker 2

The center of the model represents the fact that's a specific, undeniable event, like a financial transaction.

Speaker 1

Okay, that's the core.

Speaker 2

Right Radiating outward are the dimensions. These are the contextual realities of that event, such as the customer's geographic location or the stores operating hours at the exact time of the transaction.

Speaker 1

And managing those dimensions over time is surprisingly complex, isn't it? Well? Incredibly, the book highlights slowly changing dimensions, specifically sed TIS type two, which uses an effective date column. There's a brilliant historical example used to explain why this matters the Dutch explorer. Really yes, tracking doctor Jacob Rogavin.

Speaker 2

Right, if you look at standard relational databases, they often default to what we call SCD type one, which is simple overwriting.

Speaker 1

Meaning they just replace the old data. Yeah.

Speaker 2

So, if doctor Rogavin moves from his home in Middleburg to Easter Island in seventeen twenty two, an SCD type one system just overwrites his address.

Speaker 1

Field, which seems fine at first glance.

Speaker 2

But the problem is you've permanently destroyed your historical context.

Speaker 1

Right. But with SCD type two, you don't overwrite. No, you add a new row and you manage it with an effective date. You log that he resided in Middleburg with an n date of April fourth, seventeen twenty two, and a new row shows him residing on Easter Island effective April five, seventeen twenty two. Exactly why is maintaining that temporal timeline so critical for advanced data science.

Speaker 2

Because predictive machine learning models absolutely rely on point in time accuracy. Say your algorithm is analyzing why certain customer segments canceled their subscriptions five years ago, right, it needs to evaluate the geographic and demographic dimensions of those customers as they existed five years ago, not who they are today exactly. If your database has overwritten their historical addresses with their current ones, your training data is contaminated with future knowledge.

Speaker 1

Which completely invalidates the model's predictive power. It ruins the whole thing, and the strictness required in the data models must also be applied to the human language driving them. The text offers a brutal warning about the danger of weak words in the business layers requirements.

Speaker 2

Oh yes, business analysts frequently write non functional requirements stating a dashboard must be user friendly or a streaming pipeline must operate seamlessly.

Speaker 1

Which sounds good in the meeting.

Speaker 2

Sure, but from an engineering perspective, those words are poisoned because.

Speaker 1

They are fundamentally untestable. You can't write a unit test for seamless. You have to define strict binary thresholds like the kofa stream will process fifty thousand events per second with latency under one hundred milliseconds.

Speaker 2

Yes, if you don't translate qualitative business desires into highly specific quantitative engineering parameters, expectations misalign, and enterprise scale projects fail right before deployment.

Speaker 1

So what does this all mean? A data scientist could write perfect Skyle code, build a flawless COFFA stream, translate every format perfectly through whoors. But if a business analyst writes the word seamlessly in the requirements or forgets to properly design SED type two dimensions, the whole multimillion dollar architecture collapses purely due to human ambiguity.

Speaker 2

That is the uncompromising reality of data science.

Speaker 1

Wow.

Speaker 2

And that reality becomes legally perilous when we factor in modern regulatory frameworks like GDPR in Europe or HYPA in the US, which Vermilan addresses thoroughly.

Speaker 1

Yeah, the right to be forgotten is a terrifying technical challenge. Under GDPR, a consumer can legally that an enterprise eradicate every trace of their personal.

Speaker 2

Data, every single trace, which is huge.

Speaker 1

Yeah.

Speaker 2

This raises an important question regarding architectural accountability. Let's say you have ingested massive amounts of unstructured data into a schema on read data lake. Okay, but you neglected to implement the audit balance and control layer to track exactly how that specific user's data propagated through your Horus translations and into your downstream machine learning model. Oh man, you simply cannot delete them because you can't find them, and.

Speaker 1

The legal penalty for failing to comply can reach four percent of your global corporate.

Speaker 2

Turnover, which transforms data architecture from a back office it function into a literal existential corporate threat.

Speaker 1

It really does. Yeah, let's recap the intense journey we've mapped out today for the listener. We bypass the bottlenecks of schema on right by utilizing a wild data ake. Then we introduce the modularity of data vault hubs and satellites to add structure without losing historical agility.

Speaker 2

And we powered that storage with an industrial processing stack.

Speaker 1

Leveraging Kafka for fault tolerant ingestion.

Speaker 2

And sparks distributed memory clusters to handle the immense scale while bridging the analytical power of R and Python into that environment exactly.

Speaker 1

Then we contain that horse power using the CRISPA DM blueprint and Vermulen's five layer Enterprise framework.

Speaker 2

We routed around the nightmare of point to point integration by funneling everything through the Horus universal schema, and.

Speaker 1

Ultimately we tethered every piece of this complex machinery to strict, auditible and legally compliant business layer requirements using Moscow prioritization and point in time sun models.

Speaker 2

It is an incredibly dense, tightly integrated ecosystem.

Speaker 1

It really is.

Speaker 2

But understanding how the flow of data mandates the existence of each of these specific tools and layers is exactly what separates a narrow programmer from a true system's architect.

Speaker 1

For you listening, whether you're architecting these systems yourself or simply preparing to lead a high level strategy meeting tomorrow, understanding the mechanics of this stack gives you the vocabulary to lead the data conversation. You now understand why the structural plumbing, the audits, the hubs, the utility translations is equally, if not more critical than the predictive algorithms.

Speaker 2

Themselves, because without that robust infrastructure, the most sophisticated predictive algorithm is just an isolated math.

Speaker 1

Equation, It doesn't actually do anything right.

Speaker 2

The stack is what bridges the gap between theoretical potential and executable automated enterprise value.

Speaker 1

I want to leave you with one final provocative thought to ponder. As distributed processing frameworks like Spark continue to integrate exponentially more powerful machine learning capabilities natively, how far are we from a tipping point?

Speaker 2

Oh that's a big question, right.

Speaker 1

What happens when an AI doesn't just process the data lake, but actively begins writing its own Moscow business requirements, dynamically restructuring its own sun models, and effectively managing the human business layer itself to optimize corporate outcomes.

Speaker 2

It fundamentally upends the hierarchy. When the analytical tools become capable of dictating the enterprise strategy, the frameworks we use to govern them will have to evolve dramatically.

Speaker 1

We're standing on the edge of an ever expanding forty zetabyte ocean. It isn't just getting deeper, it's beginning to analyze the tides. Thank you for joining us on this deep dive. Keep exploring the depths of your own data

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android