Data Science Using Oracle Data Miner and Oracle R Enterprise: Transform Your Business Systems into a

Speaker 1

00:00

Imagine, right, imagine a master chef, like a world class chef, starring in this massive primetime thirty minute cooking show.

Speaker 2

00:08

Okay, I'm picturing it right.

Speaker 1

00:09

So the camera rolls, the steak is perfectly seared, the sauce is just flawless, and the audience goes completely wild. But what you don't see, though, what's hidden off camera, is that this chef spent the previous eight hours standing in a back alley literally crying while peeling potatoes and shopping like ten thousand onions.

Speaker 2

00:29

Yeah, that sounds about right for the industry.

Speaker 1

00:31

It's crazy. And in the technology world, that chef is a data scientist. So today we are looking at how enterprise organizations are finally you know, firing the potato peelers.

Speaker 2

00:41

It is a critical shift. Honestly, we are looking at a fundamental rewrite of the architecture, Like the actual plumbing of predictive analytics is being completely re routed at the enterprise level just to eliminate these massive systemic inefficiencies.

Speaker 1

00:55

Exactly. So, if you want a shortcut to understanding how big businesses are actually transfer forming their standard everyday databases into these automated predictive engines. Well, this is it. Today's deep dive is based on excerpts from the book Data Science using Oracle Data Minor and Oracle r Enterprise.

Speaker 2

01:14

Which is a fantastic resource by the.

Speaker 1

01:15

Way, Oh totally. We're exploring how bringing the math directly to the data solves honestly one of the single biggest bottlenecks in modern tech. So okay, let's unpack this because the way a lot of organizations currently execute data science is just fundamentally broken, it really is.

Speaker 2

01:32

What's fascinating here is that the real secret to effective data science at scale isn't necessarily about inventing a more complex neural network or a bitter algorithm. It's really about where those algorithms are physically executed, right.

Speaker 1

01:45

The location matters exactly.

Speaker 2

01:47

Moving massive amounts of data across networks just to analyze it is a critical, costly mistake, and the infrastructure we were exploring today was built specifically to stop that movement entirely.

Speaker 1

01:57

So to understand the fix, I feel like we first have to understand why data scientists are stuck chopping those onions in the first place. Because if you look at standard industry frameworks like the crispym methodology. You'd probably assume the actual modeling, like the glamorous machine learning part, is where all the time goes.

Speaker 2

02:18

Yeah, that is the assumption, but the reality is heavily, heavily skewed. In almost any enterprise deployment, data preparation takes up a staggering sixty to eighty percent of the total project effort.

Speaker 1

02:29

Sixty to eighty percent. That's insane.

Speaker 2

02:32

It is because real world data is inherently dirty, it's skewed, it's real with missing values. It's just a mess. Right.

Speaker 1

02:38

So if the vast majority of the job is just cleaning up missing variables and formatting timestamps, why is the titled data scientist and not data janitor.

Speaker 2

02:46

Well, because that janitoring actually dictates the success or failure of the entire model. In predictive analytics, there is basically an ironcloud rule. A simple regression model built on perfectly clean data will consistently outperform a highly sophisticated deep learning model that has been fed dirty data.

Speaker 1

03:01

Wow. Really, so the clean data beats the complex math every.

Speaker 2

03:05

Single time, because if you don't handle the anomaies, your model simply learns the noise. It just memorizes the mistakes.

Speaker 1

03:13

But spending all that time cleaning. I mean, that's the antithesis of business agility, right Like, if a telecom company wants to predict customer churn this month, they can't afford to spend three weeks manually cleaning billing data first.

Speaker 2

03:25

Exactly, And that naturally leads us to the data science automation pyramid. From the source material. To move fast, you have to automate the data pipeline.

Speaker 1

03:34

Okay, So walk us through this pyramid. What's at the bottom.

Speaker 2

03:36

At the base, you have problem specific automation. This is automating a single, rigid workflow, like maybe a monthly sales forecast that runs exactly the same way every time.

Speaker 1

03:48

Got it, Just basic scripting right.

Speaker 2

03:50

Then above that you have repetitive task automation, which is where you build generalize scripts to automatically handle missing values or transform columns across various different data sets.

Speaker 1

04:00

So that's taking away a lot of the manual janitor work exactly.

Speaker 2

04:03

It frees up the human to do actual science.

Speaker 1

04:05

Okay, So what's at the very top of the pyramid.

Speaker 2

04:07

Then the automated statistician Ooh, that sounds intense. It is This is an environment where the system evaluates the underlying data structures, learns the patterns, and automatically selects the most optimal algorithm without requiring a human to manually tune the hyper parameters.

Speaker 1

04:24

Wait, getting to that top tier sounds incredible, but I mean the glaring issue here is the friction of traditional architecture.

Speaker 2

04:30

Right.

Speaker 1

04:31

Historically, when a data scientist wanted to run those repetitive data cleaning scripts, they were using client side tools like Python or open source R or SaaS on their laptops.

Speaker 2

04:42

Yeah, which means they had to extract the data. Traditional analytical environments basically sit on a separate application server or on the data scientist's local machine. So to run your model, you have to query your central enterprise database, extract gigabytes or sometimes terabytes of data, push it over the network, it into the memory of your analytical tool, process it, and then attempt to write the results back.

Speaker 1

05:04

It sounds exhausting just describing it. So, bringing back to our chef analogy, traditional data science is like storing all your raw ingredients in this massive warehouse all the way across town. Yes, exactly, and every single time you want to test a new recipe, You literally have to drive a semi truck across the city, load up the ingredients, drive back to your kitchen, cook the meal, and then drive the leftovers back to the warehouse.

Speaker 2

05:29

It's catastrophic for efficiency. You hit network io bottlenecks immediately, you hit integration failures, and most importantly, client based tools simply choke because they require the data set to be loaded into active RAM.

Speaker 1

05:42

Right. You can't just cram everything in there.

Speaker 2

05:44

No, you absolutely cannot load a two terabyte customer table into the RAM of a standard application server. It will crash.

Speaker 1

05:51

Here's where it gets really interesting, though, because the solution presented in this architecture, specifically Oracle Advanced Analytics, basically just builds the kitchen and inside the warehouse precisely.

Speaker 2

06:02

Oracle Advanced Analytics or OAA operates directly on top of the Oracle database kernel. The data never actually moves.

Speaker 1

06:09

So you're just cooking where the food is exactly.

Speaker 2

06:11

By eliminating data extraction, you effectively achieve zero latency in your data pipeline.

Speaker 1

06:18

And I imagine you bypass the memory limits of a local machine entirely right, because the database is already optimized to query and process data directly from storage.

Speaker 2

06:28

Oh absolutely, Plus you don't have to worry about security protocols breaking down over some open network connection, right.

Speaker 1

06:34

Because once the data leaves the database, you've kind of lost control over who sees it.

Speaker 2

06:39

Security is a massive factor here. The data remains governed by the strict native security policies of the database itself. But from a purely performance standpoint, executing inside the kernel allows the algorithms to leverage oracles parallel processing capabilities.

Speaker 1

06:55

Meaning instead of one computer churning through the data row by row by row, the database can split the job across dozens of internal processors simultaneously.

Speaker 2

07:04

Exactly, and when predictive models run directly inside the kernel, the whole business posture shifts. You aren't extracting data to run some post mortem analysis of what happened last quarter.

Speaker 1

07:14

Yeah, it's not looking backward anymore.

Speaker 2

07:16

Right, You are queering the database in real time to ask, what is the probability this specific transaction happening right now is fraudulent?

Speaker 1

07:26

Okay, So the kitchen is inside the warehouse, which is great, but we still have to do the sixty to eighty percent of the workload that involves data preparation, right, I mean, bringing the math to the data doesn't magically clean it. We still need the tools to handle anomalies.

Speaker 2

07:41

Yes we do, and that is handled through an in database plseql package called DBMS data meaning transformation.

Speaker 1

07:47

Okay, quite a mouseful, it is, yeah.

Speaker 2

07:50

But basically this is the toolkit for managing that massive data preparation phase without ever leaving the database.

Speaker 1

07:56

So let's talk about how this toolkit actually works specifically withouts, Because if you have an e commerce platform in your analyze, and say average order value one user buying a fifty thousand dollars watch is going to completely skew your standard distribution.

Speaker 2

08:10

Oh absolutely, So.

Speaker 1

08:11

The toolkit handles this using what they call winsorizing or trimming.

Speaker 2

08:14

Right, unhandled extreme values will drag the mean of your data so far from the median that any distance based algorithm you use will generate completely erroneous clusters. It'll just ruin the model.

Speaker 1

08:28

So if you're using this plcql package, how do these two methods winsorizing and trimming mechanically solve that watch problem.

Speaker 2

08:35

Well, trimming is the brute force approach. It literally clips the extreme tail ends of your distribution, say the top one percent of values and just sets them to NUL.

Speaker 1

08:46

Just delete them basically effectively.

Speaker 2

08:48

Yes. Windsorizing, on the other hand, is much more elegant. Instead of removing the data point entirely, it cacks it. It replaces those extreme tail values with a specified maximum parameter, pulling the outlier back into the acceptable edge of the distribution.

Speaker 1

09:03

Oh I see, So windsorizing is like taking a person who's screaming through a megaphone in a crowded room and forcing them to just whisper. While trimming is you're just throwing them out of the building entirely so they don't ruin the party.

Speaker 2

09:14

That's a great way to visualize it. Yes, but dealing without liars is just the first step. You also have to normalize the data before you apply the algorithm.

Speaker 1

09:21

Right normalization, which is bringing variables to a uniform scale using min max or z score calculations. Because and correct me if I'm wrong. If you feed an algorithm a customer's age, which is a two digit number, alongside their annual income, which is a six digit number, the geometry of the algorithm will mathematically assume the income is exponentially more important, just simply because the integer is larger.

Speaker 2

09:46

You nailed it. The algorithm operates on mathematical distance. If you don't scale the inputs to a uniform magnitude, your model is practically useless.

Speaker 1

09:55

It just gets confused by the big numbers exactly.

Speaker 2

09:58

But the toolkit goes beyond scaling. It also performs complex binning, which is transforming continuous data into discrete categories.

Speaker 1

10:05

Yeah. The supervised binning feature is what really caught my eye in the source text because instead of a human arbitrarily deciding that high income starts at exactly one hundred thousand dollars, supervised binning automates the logic it does.

Speaker 2

10:18

It uses a decision tree algorithm under the hood, so the system analyzes the data's relationship to your target outcome, like whether a customer churned or not. If the decision tree determines that a massive spike inchurn happens specifically when income drops below let's say sixty four three hundred dollars, it sets the bin boundary exactly there.

Speaker 1

10:39

Oh wow.

Speaker 2

10:39

Yeah, it lets the predictive power of the data dictate the categorization completely, removing human bias.

Speaker 1

10:46

Well, wait, if I'm just a standard database administrator or a business analyst running this, How do I know if the algorithm I want to use requires minmac scaling or a Z score or supervised binning like I wouldn't.

Speaker 2

10:57

Know that, and you frequently don't need to know. Oracle utilizes a feature called Automatic data Preparation or ADP nice. When enabled, ADP intercepts your request evaluates the specific algorithm you've chosen, say a support vector machine, which strictly requires normalized inputs, and it automatically executes the correct mathematical transformations inside the kernel before running the model.

Speaker 1

11:20

That is so cool. It handles the prerequisites dynamically, so the data is prepped, the environment is secure, and we are finally ready for the top tier of that pyramid. We talked about the automated statistician.

Speaker 2

11:30

Yes, this is where we look at the DBM's predictive analytics package. It contained three highly automated APIs, Predict, Explain and Profile.

Speaker 1

11:40

Right Predict automatically generates an outcome variable. Explain ranks the importance of the independent variables, and Profile extracts the core business rules the model found. You literally just pass it the table name and the target column and it does the rest.

Speaker 2

11:54

It really is that straightforward.

Speaker 1

11:55

So what does this all mean? It sounds like we are completely democratizing machine learning, allowing average SQL users to perform data science without knowing the math.

Speaker 2

12:04

It absolutely democratizes access, But in this raises an important question. Is it safe to lower the barrier to entry that far?

Speaker 1

12:11

That's a very fair point.

Speaker 2

12:13

When an analyst just presses a predict button, they are essentially trusting a black box. The system is making vast mathematical assumptions on their behalf.

Speaker 1

12:22

I agree, and frankly, I'm a bit skeptical. If you let an average user bypass the math and just blindly apply predictive models to their company's revenue data, aren't we just accelerating how fast they can make a catastrophic business decision.

Speaker 2

12:34

That is the inherent risk of democratization, without a doubt, If the user doesn't understand the underlying assumptions of the models, the results can be dangerous. Most powerful parametric machine learning models assume that your underlying data follows a normal Bell curve distribution. Right If you feed them highly skewed, non normal data, the predictions will be mathematically invalid, period.

Speaker 1

12:59

Which is why having statistical tests built directly into the databas is so critical, I guess you don't have to blindly trust the black box. You can use native sequel functions like the Shapiro Wilks test to evaluate the normality of your distribution right there in the query exactly.

Speaker 2

13:13

Shapiro Wokes evaluates the null hypothesis that your sample came from a normally distributed population.

Speaker 1

13:20

Okay, so if I run that SQL query and it returns a P value.

Speaker 2

13:23

Of zero, you instantly know your data is non normal. You can test your assumptions without having to extract the data to a specialized statistical software package. It's all right there.

Speaker 1

13:33

And the analytical capabilities of modern sequel don't stop there. The source material dives into functions like lag lead and these really complex windowing functions. And these aren't just convenient syntax, they are massive performance life savers.

Speaker 2

13:47

They really are lag and let allow you to access data from previous or subsequent rows in the exact same result set without having to use a clunky self joint.

Speaker 1

13:57

So like if a retailer is calculating year over year's sales growth across ten thousand stores, they don't have to pull millions of rows into a Python data frame just to calculate a rolling average. They can use a SQL windowing function to calculate that moving average directly on the storage disc and.

Speaker 2

14:14

By processing that rolling average inside the database. Using SQL, you are leveraging the internal optimizer. It completes the calculation in a fraction of the time and only returns the final aggregated insight to the application layer.

Speaker 1

14:27

Okay, so SQL is incredibly powerful. But let's play Devil's advocate for a second.

Speaker 2

14:31

Let's do it.

Speaker 1

14:32

What if your company's lead data scientist is like a hardcore statistical researcher, you know the type. They spend eight years getting a PhD. They live and breathe the open source R programming language, and they rely on these massive, crowdsourced libraries of cutting edge algorithms that standard seql just doesn't natively support.

Speaker 2

14:52

Are they just forced to abandon R and write Oracle plseql not at all.

Speaker 1

14:57

That exact friction is what Oracle Are Enterprise or ORE was engineered to eliminate.

Speaker 2

15:02

Because open source R has the same architectural flaw we talked about earlier. Right, it's entirely client based. Has to load everything into the local laptops around exactly.

Speaker 1

15:09

Open source R is brilliant for innovation, but it is fundamentally incapable of handling true enterprise big data. If you try to run an advanced clustering algorithm on a billion rows of transaction data using open source R, the memory limit will immediately crash the session.

Speaker 2

15:24

Just a blue screen of death.

Speaker 1

15:26

Basically pretty much.

Speaker 2

15:27

So how does already solve this without forcing that PHGD data scientist to learn a whole new language. The source outlines a three layer architecture to make our database compatible. Right, so layer one is simply the client R engine. The data scientist sits at their laptop and write standard R code in their normal ide. They don't change their workflow at all.

Speaker 1

15:45

Okay, and layer two is where the magic happens.

Speaker 2

15:47

Right.

Speaker 1

15:48

The database has this transparency layer.

Speaker 2

15:50

Yes, this is the crucial translation mechanism. When the data scientist writes an OUR command to filter a data set or apply a transformation, the transparency layer intersects that command. Oh, okay, does not pull the data to the laptop. Instead, it dynamically translates the R syntax into a highly optimized SEQL query. It maps the R data frames directly to Oracle tables or views.

Speaker 1

16:11

That is wild. So the data scientist thinks they are manipulating a local R data frame, but behind the scenes, Oracle is essentially spoofing the environment and executing a native SQL query on the server exactly.

Speaker 2

16:23

It's totally seamless. And then layer three consists of spawned R engines running directly on the database server itself. If the data scientist uses an ORE package like ort M, which maps directly to Oracle data mining algorithms, the execution happens entirely inside the kernel using parallel processing.

Speaker 1

16:39

But wait, what if they are using a custom third party R package that Oracle doesn't natively map to. How do you keep the memory from crashing.

Speaker 2

16:47

Then that's where functions like or dot row apply come in. It allows the database server to partition the massive data set into manageable chunks, spawn multiple R engines directly on the server in parallel, feed the data chunks to those engines, and then reassemble the results at the end.

Speaker 1

17:03

Oh that's incredibly smart.

Speaker 2

17:05

Yeah, you get the full analytical power of custom our packages without ever moving the data across the network or overwhelming a single machines RAM.

Speaker 1

17:13

If we connect this to the bigger picture. This integration is the ultimate bridge. You are taking the rapid innovation and the massive crowd source brilliance of the open source our community, and you're seamlessly plugging it directly into the heavy duty, industrial scale processing power of an enterprise database.

Speaker 2

17:31

You get the best of both worlds. The agility of open source statistical libraries combined with the scalability, parallel execution, and strict security of a Tier one database. You just don't have to compromise anymore.

Speaker 1

17:42

This has been a really fascinating exploration. We've completely broken down why the traditional paradigm of data science, you know, extracting the data from the source and moving it to

17:53

the math, is just a fragile, bottlenecked system. And we've seen how inverting that paradigm, bringing them directly to the data through Oracle Data Minor advanced SQL analytics and that awesome transparency layer of Oracle our Enterprise allows organizations to execute real time, highly scalable predictions without moving a single byte of data.

Speaker 2

18:14

It permanently alters the velocity in which an enterprise can generate actionable foresight.

Speaker 1

18:18

It changes everything, and for you listening understanding this architectural shift puts you at a massive advantage, whether you are architecting a back end, leading a business unit, or just tracking the evolution of AI. Knowing that data movement and data preparation are the true hidden constraints of machine learning really changes how you should evaluate every new tech solution on the market.

Speaker 2

18:38

It absolutely should dictate your strategy and want to leave you with a final thought. Tom All over, we spend a lot of time discussing the top tier of that automation pyramid, the automated statistician with tools actively translating code, automatically normalizing variables, and letting decision trees handle data prep. The mechanical friction of data science is vanishing.

Speaker 1

19:01

Yeah, it's getting so automated.

Speaker 2

19:03

So if this trend accelerates over the next decade, what happens to the human data scientist? Will the prestigious role of data scientists eventually pivot away from writing code entirely, transforming them into business strategists who simply understand how to ask a database the right strategic question.

Speaker 1

19:20

It's an incredible thought. From spending eight hours peeling potatoes to finally just sitting at the chef's table and designing the menu thanks for taking the deep dive with us.

Transcript source: Provided by creator in RSS feed: download file

Data Science Using Oracle Data Miner and Oracle R Enterprise: Transform Your Business Systems into an Analytical Powerhouse

Episode description

Transcript