Easily: Practical Machine Learning Algorithms with Python

Speaker 1

00:00

Okay, So have you ever felt like you're just drowning in information, you know, for a project, or maybe getting ready for a meeting, or even just trying to learn something new, and you just wish someone could kind of boil it all down.

Speaker 2

00:11

Yeah, just give you the essentials, right, what really.

Speaker 1

00:13

Matters exactly, and maybe, you know, throw in a few surprising bits to keep it interesting. Well, if that sounds like you, you are definitely in the right place, because that's what we do here on the Deep Dive. We're sort of your shortcut to getting properly informed, and today we're taking a deep dive into your source material. These are excerpts from Easily Practical Machine Learning Algorithms with Python by doctor Darren Thomas.

Speaker 2

00:38

Yeah, and this isn't like your standard dance textbook. The author, doctor Thomas, he's got a PhD, loads the teaching experience.

Speaker 1

00:46

And get this a background and saxophone performance, right, but.

Speaker 2

00:50

His passion for machine learning led him to use these algorithms in education. He's a lecturer now at Asia Pacific International University and.

Speaker 1

00:58

The book's aim, which is really really key for us today, is to be simple, easy to follow, a kind of condensed guide for actually using these algorithms with Python exactly.

Speaker 2

01:09

He even says his goal was always to show what to do, rather than talk a lot about how to do it. So less heavy theory, more hands on application.

Speaker 1

01:17

Okay, so important. Note, then, the book and this deep dive too, sort of assumes you've already got some background in Python. Maybe data science stats.

Speaker 2

01:27

Yeah, it's more for folks looking to build on existing skills, maybe not for absolute beginners to data science itself.

Speaker 1

01:33

Right, So our mission today we want to unpack some of the most common machine learning algorithms. We'll look at classification that's predicting.

Speaker 2

01:41

Categories like spam or not spam.

Speaker 1

01:43

Exactly, and numeric prediction predicting continuous values like say, house prices. Will get into how they're used, their surprising upsides, their challenges, and crucially, how you actually figure out if the models any good and how to make it better.

Speaker 2

01:58

Yeah, judging and improving them key.

Speaker 1

02:00

Totally, So ready to jump in. Let's unpack this.

Speaker 2

02:02

Let's do it, Okay.

Speaker 1

02:04

First up, decision trees. You can think of these as like the foundation for a lot of predictive stuff.

Speaker 2

02:11

Right At its heart, a decision tree is just a way to classify things or predict numbers by splitting your data up. Splitting it how It basically keeps dividing the sample into smaller and smaller groups, trying to make each little group as similar as possible inside.

Speaker 1

02:26

So it looks like a tree visually like a flow.

Speaker 2

02:28

Shart exactly like a float chart. You start at the top the root node. That's your first big decision point. Then you follow branches down through more decision nodes, making more.

Speaker 1

02:36

Splits until you hit the end the leaf nodes.

Speaker 2

02:39

Yep, the leaf nodes. That's where you get your final prediction.

Speaker 1

02:42

And how does it decide where to split?

Speaker 2

02:44

Well? For classification, it often uses something called entropy. High entropy means things are really mixed up. The tree tries to make splits that reduce that entropy, creating purer groups.

Speaker 1

02:55

Lower entropy, more homogeneous, got it. And for predicting numbers.

Speaker 2

03:00

Prediction that uses metrics like means squared error msee or maybe R square to guide the splits.

Speaker 1

03:07

Okay, so what's great about them? Why start here?

Speaker 2

03:10

Well? A big plus is flexibility. They handle missing data pretty well. They don't really care if your data isn't, you know, perfectly normal. A nice bell curve right, and you don't even have to use all your variables. Plus, and this is a big one, they're relatively easy to interpret.

Speaker 1

03:25

Ah, so you don't need a math PhD to figure out what it's doing.

Speaker 2

03:29

Pretty much. You can literally trace a path down the tree and see the reasoning that.

Speaker 1

03:34

Transparency sounds really useful, especially if you need to explain why a prediction was made.

Speaker 2

03:40

Absolutely imagine telling someone why their loan was denied. Showing them a simple tree is way easier than explaining you know, complex equations from some other models. Builds trust.

Speaker 1

03:52

Okay, but there's always a catch, right, what's the downside?

Speaker 2

03:55

The main one is that they can get really complex if you let them grow too much, really deep trees, and that leads to that often leads to overfitting.

Speaker 1

04:03

Overfitting like it learns the training data too well.

Speaker 2

04:06

Exactly, it fits the specific sample data perfectly, maybe even the noise. But then it can generalize well to new data it hasn't seen before.

Speaker 1

04:14

Like memorizing answers instead of understanding the concept perfect analogy.

Speaker 2

04:18

And of course, a super complex tree, even though it's visual, can still be hard to explain easily.

Speaker 1

04:24

Okay, let's make it concrete. The source had an example, right, a cancer data.

Speaker 2

04:27

Set, Yeah, predicting health status alive or dead. The model mainly used variables like time and age for its splits.

Speaker 1

04:35

Then how did it perform?

Speaker 2

04:36

It got seventy eight percent accuracy on the data it trained on, okay, but then on the unseen test data it drops slightly to seventy three percent.

Speaker 1

04:44

That drop five percent is that bad or expected?

Speaker 2

04:47

That's actually pretty common and often expected. It shows it's generalizing. Okay, still performing decently on new stuff. It learned patterns and applied them.

Speaker 1

04:54

Right now, What about using it for numeric prediction, same data set, but predicting aid yep.

Speaker 2

05:01

So here, instead of looking at purity like with dagony, the tree uses mse means squared error in the nodes, and the leaves predict the average age for that group.

Speaker 1

05:10

And which variables were important there.

Speaker 2

05:12

The source mentioned pH dot Carno Andmeal dot.

Speaker 1

05:14

Cow and the results.

Speaker 2

05:16

Well, the correlation between the actual age and predicted age was okay on the training data about point five to four moderate moderate yeah, But on the test data it dropped way down to point one eight.

Speaker 1

05:28

Ouch, big drop? What about the error the.

Speaker 2

05:31

MSc MC was sixty one point eight on the training set, but jumped up to eighty five point twenty four on the test set.

Speaker 1

05:37

So again, that drop tells us.

Speaker 2

05:39

It tells us while it learned something, it really struggled to generalize the age prediction to new people. Highlights that overfitting risk with single trees, which.

Speaker 1

05:48

Leads us perfectly into the next one. Random forest. This sounds cool.

Speaker 2

05:51

Yeah, this is where it gets really interesting. Random Forest tackles that overfitting problem head on. How instead of building just one decision tree, it builds hundreds, maybe even thousands of them.

Speaker 1

06:00

Wow? Okay, where does the random part come?

Speaker 2

06:02

In? Two places? First, each tree is built using a different random sample of your data drawn with replacement called boots trapping. Second, at each split point in a tree, it only considers a random subset of your available features.

Speaker 1

06:15

So not every tree sees all the data, and not every split considers all the factors exactly.

Speaker 2

06:20

And the idea is you get lots of slightly different trees, none of them perfect, but hopefully their errors are kind of random and cancel each other out.

Speaker 1

06:28

So how does it make a final prediction with all those trees?

Speaker 2

06:31

It's pretty democratic. Actually, for classification, it's just a majority vote whichever prediction most trees make.

Speaker 1

06:38

Wins simple enough, and for predicting numbers.

Speaker 2

06:41

It just averages the predictions from all the individual trees.

Speaker 1

06:44

And this whole wisdom of the crowd thing really helps with overfitting massively.

Speaker 2

06:49

That aggregation step makes the model much more robust and way less prone to overfitting compared to a single complex decision tree, so.

Speaker 1

06:59

The benefits seem clear less. Overfitting works well even if you don't have tons of data handles missing values.

Speaker 2

07:06

Sounds great, It often is. It's a very popular, reliable algorithm for those reasons.

Speaker 1

07:10

Okay, but the drawback you mentioned transparency with decision trees, what about here?

Speaker 2

07:15

Oh yeah, that's the main trade off. With potentially thousands of trees. You can't just visualize it like a single flow chart.

Speaker 1

07:22

Anymore, so it becomes a black box pretty much.

Speaker 2

07:25

You know what goes in, you know what comes out, but explaining exactly how it arrived at that specific prediction is really hard.

Speaker 1

07:32

That's a problem when.

Speaker 2

07:34

When you need to explain the why, like we said, loan applications, medical diagnoses, if you can't explain the reasoning, it can cause issues with trust or even regulations that demand transparency.

Speaker 1

07:46

Right, so that lack of interpretability is a real consideration. You might get great predictions but lose the explanation.

Speaker 2

07:52

It's a definite trade off you have to weigh.

Speaker 1

07:54

Let's look at the example the doctor Aus data set predicting gender.

Speaker 2

07:58

Right, the source noted income and age came out as strong predictors, pointing out the known differences in salaries and life expectancy. And the performance impressive on the training data ninety three percent accuracy. Wow, but then quite a big drop on the test data, down to sixty six percent.

Speaker 1

08:15

Oo. That's a nearly thirty percent draw. What does that tell us?

Speaker 2

08:18

It tells us that while the model learned the training data extremely well, almost perfectly, it really struggled to generalize that learning.

Speaker 1

08:27

So even random forest isn't immune to some overfitting. Or maybe the training data just wasn't fully representative.

Speaker 2

08:33

Could be either or both. It's a stark reminder that high training accuracy is nice, but test accuracy is what really counts for real world use.

Speaker 1

08:43

Okay, and the numeric prediction example predicting income from the same doctor Aus data.

Speaker 2

08:48

Set here age was the most important variable by far makes intuitive sense, right. Yeah, people often earn more as they get older.

Speaker 1

08:53

Sure, and the numbers.

Speaker 2

08:55

Strong correlation on the training data point eight three, but again a drop on the test data down to point four to eight.

Speaker 1

09:00

Still a decent drop, and the error ms.

Speaker 2

09:04

MSc was low on the training set point zero four and higher on the test set point one to one.

Speaker 1

09:08

So similar story. Good at learning the training data, but only moderately good at generalizing the income prediction.

Speaker 2

09:14

Yeah, pretty much. It captures the relationship it sees, but applying it to new unseen individuals is where the real test lies.

Speaker 1

09:20

Okay, moving on K nearest neighbor or k NN, this one sounds neighborly, huh.

Speaker 2

09:26

Yeah, it's actually quite intuitive. The core idea is predicting by proximity, like.

Speaker 1

09:31

Your example of walking into a classroom of twelve year olds. If another kid walks in, you guess they're also twelve.

Speaker 2

09:37

Exactly like that, Yeah, kNN looks at an unknown data point and finds the k known data points that are nearest to it in the feature space.

Speaker 1

09:45

And nearest is usually measured by.

Speaker 2

09:47

Typically Euclidian distance, just the straight line distance between points in that multi dimensional space.

Speaker 1

09:53

So K is just how many neighbors you look at like the three nearest or five nearest.

Speaker 2

09:58

Yep, K is the number of you consider.

Speaker 1

10:01

And how do those neighbors make the prediction for classification?

Speaker 2

10:03

They vote. That's why K is usually an odd number to avoid ties. For numeric prediction, it's just the average of the neighbor's values.

Speaker 1

10:11

What's it good for.

Speaker 2

10:12

It's pretty good with non linear data where the boundary is in a straight line. And it's non parametric, meaning it doesn't make strong assumptions about how your data is distributed. Makes it flexible.

Speaker 1

10:22

Sounds simple enough. Any hidden traps.

Speaker 2

10:25

Well, it's sometimes called a lazy learning algorithm. Lazy because it doesn't really build a model during training. It basically just stores all the training data. The real work happens only when you ask for a prediction.

Speaker 1

10:37

Ah. So it doesn't give you much insight into why variables are important.

Speaker 2

10:41

Not really no abstraction. And because it stores everything, it can struggle with really large data sets. Needs a lot of memory.

Speaker 1

10:50

And there was another crucial point something about scale.

Speaker 2

10:54

Yes, critically important for K and N it is scale sensitive.

Speaker 1

10:58

Okay, break that down. What does scale sensitive mean? Practically?

Speaker 2

11:02

Imagine you have age maybe zero to one hundred, and another variable like own scar, which is just zero or one. When Cainean calculates distance, the age difference will totally swamp the own scar difference, just because the numbers are so much bigger.

Speaker 1

11:15

So age will have way more influence on who's considered nearest.

Speaker 2

11:18

Exactly, even if owning a car is actually super important for the prediction. Yeah, so you have to scale your data first.

Speaker 1

11:24

How do you scale it?

Speaker 2

11:25

Common ways are a minmax scaling where you squish everything into a zero to one range, or standardization where you give variables a mean of zero and standard deviation of one. It puts everything on a level playing.

Speaker 1

11:35

Field, right, So age isn't shouting louder than the other variables. Makes sense. The example use the turnout data set to predict if someone voted yeah.

Speaker 2

11:42

And they specifically mentioned scaling the data first. The model got almost eighty percent accuracy on training, and importantly, it held up really well on the test data too.

Speaker 1

11:51

That consistency is good, right, suggests it's generalizing.

Speaker 2

11:54

Exactly what you want to see, not just memorizing.

Speaker 1

11:57

And for predicting income from that same turnout.

Speaker 2

12:00

The correlation was zer point six y three on training dropped a bit too. Point four to eight on test, but the MSc values were very close point zero two to one for training, point zero two nine for testing.

Speaker 1

12:11

So again, even if the correlation isn't super strong, the similar error rates suggest it's performing consistently on new data.

Speaker 2

12:18

YEP indicates stable performance, which is often more important than hitting the absolute highest correlation number.

Speaker 1

12:23

Okay, Next algorithm, Support vector machines SVM. This sounds a bit more complex.

Speaker 2

12:29

It combines ideas from K and N and linear models, but with a really clever twist for classification, which is its main goal is to find the best dividing line or plane or hyperplane in higher dimensions.

Speaker 1

12:41

Hyperplane fancy word for boundary.

Speaker 2

12:43

Right. It wants the boundary that creates the biggest possible gap or margin between the different classes.

Speaker 1

12:50

A bigger buffer zone exactly.

Speaker 2

12:52

And the data points that sit right on the edge of that margin. Those are the support vectors. They're the critical ones that actually define the boundary.

Speaker 1

12:59

Interesting, does it always have to be a straight line? What if the data is all mixed up?

Speaker 2

13:03

Good question. It prefers straight lines, but it has tricks. It can allow some misclassifications using a slack variable, a bit of wiggle room, or for really messy nonlinear data. It uses the kernel trick.

Speaker 1

13:16

The kernel trick sounds like magic, It kind of is.

Speaker 2

13:19

It projects the data into much higher dimensional space where hopefully a simple linear boundary can separate the classes.

Speaker 1

13:26

Like unfolding a crumpled paper to separate dots. I like that analogy. So sbms are flexible, can handle messy data.

Speaker 2

13:33

Very flexible, Yeah, I think can be incredibly accurate even on complex problems.

Speaker 1

13:37

Okay, sounds powerful. Downside, is it another black box?

Speaker 2

13:41

Often? Yes, especially with those kernel tricks and higher dimensions. Explaining why SBM made a particular decision gets very abstract, very.

Speaker 1

13:49

Quickly right, hard to explain to the boss.

Speaker 2

13:51

Can be Also, choosing the right kernel isn't always obvious, And like K and N, it's scales sensitive. You need to rescale your data.

Speaker 1

13:59

Got it scaling again. The example was predicting mortgage status. Yes know from the working hours data.

Speaker 2

14:05

Set correct and data prep was key. They combined some child related variables and rescaled everything. Then they compared two common kernels, linear and RBF radio basis function.

Speaker 1

14:15

Oh do they do?

Speaker 2

14:16

The linear kernel hit eighty seven percent accuracy on training and impressively held that exact same accuracy on the test data.

Speaker 1

14:24

Wow, perfect generalization in that case.

Speaker 2

14:26

Fantastic result. The RBF kernel was just slightly lower and.

Speaker 1

14:29

For SVM regression predicting education level from the same data set.

Speaker 2

14:33

Yeah. Again. Comparing linear and RBF kernels, the linear one showed a pretty weak correlation, only point three to eight on training and point four zero on tests. That doesn't sound great, but the MS values were low and very stable point zero one five eight nine on training, point zero one eight three to two on testing.

Speaker 1

14:52

So weak relationship overall, but the model makes consistent predictions exactly.

Speaker 2

14:57

Suggests it's generalizing well, even if it's not explaining huge amount of the variants. The source also mentioned outliers might be affecting the correlation metric more here.

Speaker 1

15:05

Interesting. So metrics can sometimes tell slightly different stories.

Speaker 2

15:08

Definitely, you need to look at them together, all right.

Speaker 1

15:10

Artificial neural networks ann's the brain inspired ones kind of.

Speaker 2

15:16

They were initially inspired by biological neurons. Yeah. You have inputs, some processing happens, and you get outputs.

Speaker 1

15:22

And deep learning is just when you have lots of layers of these neurons.

Speaker 2

15:25

Multiple hidden layers. Yes, that's the essence of deep learning.

Speaker 1

15:28

So how do they actually work? Simply put, think.

Speaker 2

15:30

Of inputs like signals arriving. Each input gets a weight how important it is. They get summed up and then hit an activation function like.

Speaker 1

15:37

The neuron deciding whether to fire.

Speaker 2

15:39

Sort of yeah, that function decides if the signal is strong enough to pass on to the next layer. Usually the information flows one way feed forward. It's a whole cascade of these simple weighted sums and activations.

Speaker 1

15:51

And the big advantage why all the hype, They.

Speaker 2

15:54

Really shine with massive amounts of data. Given enough data, they can learn incredibly complex suttle patterns that other algorithms might miss entirely.

Speaker 1

16:03

So flexibility is huge. Image recognition, self driving cars.

Speaker 2

16:07

Exactly, they tower a lot of cutting edge AI tasks.

Speaker 1

16:11

But the catch they need tons of data.

Speaker 2

16:13

Typically, yes, massive data sets for optimal performance, and training them can take a lot of computing power and time. Plus sometimes simpler networks can struggle to converge, meaning they don't actually learn effectively.

Speaker 1

16:25

Okay, example time predicting union membership from the wages data set.

Speaker 2

16:29

Right, And a key step here was turning categorical variables like occupation into dummy.

Speaker 1

16:35

Variables numerical flags basically.

Speaker 2

16:37

Yep, zeros and ones. The network can understand. The model achieved a solid seventy percent accuracy, and importantly, it was consistent between the training and test data.

Speaker 1

16:46

Good stability again and the regression example predicting wages.

Speaker 2

16:50

Also from the wages data set, it showed a moderate correlation point five to six on training, very close point five to four odd tests, and the MSS values were also very so between the two.

Speaker 1

17:00

Sets, So reasonably good generalization there too. Consistent if not spectacular prediction.

Speaker 2

17:06

Looks like it reliable performance on new data.

Speaker 1

17:09

Now for something completely different, k means you said this one's unsupervised learning.

Speaker 2

17:13

That's right. This is fascinating because, unlike everything else we've discussed, there's no right answer or target variable we're trying.

Speaker 1

17:20

To predict, no gender, no income, no voted label exactly.

Speaker 2

17:24

KMES isn't trying to predict anything specific. It's just trying to find natural groupings or clusters within the data itself, based on similarity.

Speaker 1

17:32

How does it find these groups?

Speaker 2

17:33

It starts by randomly guessing the locations of K cluster centers or centroids, K.

Speaker 1

17:38

Being the number of clusters you think are in the data.

Speaker 2

17:41

Precisely. Then it assigns each data point to its nearest centroid. After that, it recalculates the position of each centroid to be the actual center of all the points assigned to it.

Speaker 1

17:52

And it repeats that. Assigned points move centers yep.

Speaker 2

17:55

It iterates back and forth, assigned points update centroids until the centroids start moving much until things stabilize, and.

Speaker 1

18:02

The researcher has to decide on K the number of clusters. That sounds tricky, it can be.

Speaker 2

18:06

It's a key challenge.

Speaker 1

18:07

So what's the big benefit of doing this.

Speaker 2

18:09

It's fantastic for exploring your data, for segmentation, finding hidden patterns you didn't even know existed, understanding what makes different subgroups within your data.

Speaker 1

18:18

Distinct, discovering natural segments exactly.

Speaker 2

18:21

But the drawbacks stem from that unsupervised nature. Since there's no right answer, evaluating how good the clusters are is more subjective, relying on the researcher's interpretation.

Speaker 1

18:31

Hell me guess scale sensitive.

Speaker 2

18:34

You got it requires data normalization or scaling, just like kNN and SVM because it relies on distance calculations, and yeah, choosing the right K is tough.

Speaker 1

18:44

Are there ways to help choose K.

Speaker 2

18:46

There are methods, yeah, like the elbow method, where you plot a measure of cluster cohesion against different values of K and look for an elbow point where adding more clusters doesn't improve things much.

Speaker 1

18:56

Okay, the example used the act SAT scores age education.

Speaker 2

19:03

Yes, and they stressed normalizing the data first, because scores, age and education level are all on different scales.

Speaker 1

19:10

Did the elbow method work?

Speaker 2

19:12

It suggested K two two clusters, but the author actually decided to go with K three, looking for maybe a bit more detail in the groupings.

Speaker 1

19:18

And what did those three clusters reveal.

Speaker 2

19:21

Pluster zero had the oldest students, most education, highest test scores generally, Cluster one was kind of medium across the board. It's second oldest, second highest education, and Cluster two was much younger with weaker test performance.

Speaker 1

19:33

Seems logical age and achievement grouping together.

Speaker 2

19:36

Yeah, But here's the kicker, the really surprising part. When they looked closer, what did they find? A really clear separation by gender emerged Cluster one all males, Cluster zero and two all females.

Speaker 1

19:49

WHOA, The algorithm wasn't told about gender, right, It just found that pattern exactly.

Speaker 2

19:54

It revealed this underlying structure, and it helped explain things like why cluster one one all males had particularly high quantitative set scores tying into statistical trends.

Speaker 1

20:05

That's amazing. So kymines didn't just group by scores. It uncovered this fundamental demographic split. What does that tell us? Generally?

Speaker 2

20:13

It shows how these unsupervised methods can reveal really deep insights, sometimes biases or strong correlations we weren't even looking for. It forces you to ask why the data is structured that way.

Speaker 1

20:24

So the final interpretation was cluster.

Speaker 2

20:26

Zero older educated women, cluster one young males, Cluster two very young women, a much richer picture than just high medium, low scores.

Speaker 1

20:34

Incredible. Okay, so we've seen all these algorithms, but just getting an answer isn't enough. We need to know if the answer is good.

Speaker 2

20:40

Assessing models absolutely critical, otherwise you're just generating numbers without knowing if they're meaningful.

Speaker 1

20:45

And for classification, the go to tool is the confusion matrix.

Speaker 2

20:49

Definitely a fundamental starting point. It just lays out your predictions against the actual truth in a simple table.

Speaker 1

20:55

Remind us of the four boxes you've.

Speaker 2

20:57

Got true negatives TN correctly saying some thing isn't there. False negatives FN missing something that is there. Big problem in medical tests.

Speaker 1

21:05

For instance, right false positives.

Speaker 2

21:07

FP seeing something is there when it isn't, like a spam filter blocking a real email.

Speaker 1

21:12

And true positives TP.

Speaker 2

21:14

Correctly identifying something that is there the spam filter catching actual spam.

Speaker 1

21:18

So the cancer decision tree example had four ten zero FN, twenty eight FP, and eighty four TP. Lots of numbers. If I'm building, say a system to detect faulty products on an assembly line, which metrics should I care most about?

Speaker 2

21:32

Ooh, good question. For faulty products, you probably care a lot about. Recall also called sensitivity. That's true positives divided by all the actual positives TP plus FN. You want to catch as many faulty products as possible, even if you accidentally flag a few good ones. Missing a faulty one could be costly or dangerous.

Speaker 1

21:49

So recall is about minimizing misses. For the cancer example, recall was point seventy six yes.

Speaker 2

21:55

And precision, which is true positives divided by all the ones the model predicted is positive TP plus FP was also point seven to six. Precision is about how trustworthy A positive prediction is when the model says it's cancer. How often is it right? And the F measure that just combines precision and recall into one score also point seventy six in this case because there were zero false negatives, which is unusual.

Speaker 1

22:19

What about the negative specificity?

Speaker 2

22:21

Specificity is true negatives divided by all the actual negatives tmplus FP. It measures how well the model identifies the true negatives. For the cancer example, it was only point one three, quite poor, meaning it wasn't good at correctly identifying healthy people as healthy.

Speaker 1

22:35

Okay, so different metrics matter depending on the goal. What about plain old accuracy?

Speaker 2

22:40

Accuracy is just tmplus TP divided by the total all the correct predictions. It was seventy six percent here. But accuracy can be misleading, especially if one class is way more common.

Speaker 1

22:50

Than the other, like if ninety nine percent of emails aren't spam. A model predicting not spam all the time gets ninety nine percent accuracy, but is useless Exactly.

Speaker 2

23:00

That's why you need these other metrics. An error is just one minus accuracy, so twenty four percent here.

Speaker 1

23:05

And finally, kappa.

Speaker 2

23:06

Kappa measures accuracy but accounts for how much agreement you'd expect just by chance, closer to one is better. Here was point one seven to one, which is pretty low. Suggests the model's performance wasn't much better than random guessing once you factor chance in. Wow.

Speaker 1

23:21

Okay, so looking at all of them gives a much fuller picture.

Speaker 2

23:24

Definitely.

Speaker 1

23:25

Now the ROC curve this is a cool backstory. Wwii.

Speaker 2

23:29

Yeah. Radar engineers used it to figure out if a blip was an enemy plane a true positive or just noise a false positive. High stake stuff.

Speaker 1

23:37

Now we use it for models.

Speaker 2

23:39

Yep. It plots the true positive rate against the false positive rate at various thresholds. Ideally, you want the curve to shoot up towards the top left corner high true positives, low false positives.

Speaker 1

23:50

A diagonal line is bad like random guessing.

Speaker 2

23:52

Exactly the area under this curve AUC gives a single number, zero to one. The cancer model had an AEC of zero point seven nine, which is generally considered acceptable. Maybe fairy good okay.

Speaker 1

24:03

Another key technique cross validation. This is for checking generalizability.

Speaker 2

24:08

Right. Instead of just one train test split, you divide your training data into say five or ten folds like slices. Yeah. Then you train the model five times. Each time, you train on four folds and test on the one fold left out.

Speaker 1

24:20

Using a different slice for testing each time exactly.

Speaker 2

24:23

Then you average the results from those five tests. It gives you a much better idea of how stable the performance is and how well it's likely to perform on genuinely new data.

Speaker 1

24:32

And the real test set, the one you held back at the start.

Speaker 2

24:35

Crucially, you only touch that once, right at the very end, after you've done all your model selection and tuning, using cross validation on the training data, don't peak.

Speaker 1

24:43

Got it? For the cancer decision tree, cross validation showed about seventy three percent accuracy, but with a standard deviation of eleven percent.

Speaker 2

24:52

Yeah, that standard deviation tells you there was quite a bit of variability and performance across the different folds. Maybe do the small sample size making the full quite different from each other.

Speaker 1

25:01

Okay, shifting to assessing regression models predicting numbers. No confusion matrix here, right.

Speaker 2

25:07

Right, we use different metrics. You'd start by comparing basic dats, mean standard deviation, core tiles of the actual values versus your predicted values. Ideally they should look pretty similar.

Speaker 1

25:17

Like for the age prediction the means were close, but standard deviations.

Speaker 2

25:21

Differed a bit. Hey exactly. Then you look at the correlation between actual and predicted for age that was point five to four on the training set.

Speaker 1

25:29

We also saw means squared error MS.

Speaker 2

25:31

Lower is better, lower is better, Yes, for age, it was sixty one point eight on train eighty five point twenty four on test. The different shows that drop in performance on unseen data and are squared. Our squared tells you how much of the variation in the outcome variable your model explains. Ranges from zero to one. For age prediction, that was point twenty nine, which the source called not exciting. It means the model only explained about twenty nine percent of the variation in age.

Speaker 1

25:56

And cross validation applies here too.

Speaker 2

25:58

Absolutely, you'd cross validate metrics like MSE and R two for that SVM regression predicting education. The cross validated MS was similar to the original test MAC, which is good, But the R two, the cross validated R two had a high standard deviation point one nine. So I guess the model's ability to explain the variance wasn't very stable across different subsets of the data. Again, maybe sample size issues or just a complex relationship.

Speaker 1

26:21

Okay, we've built models, we've assessed them now making them better. We know we can get more data, change variables, switch algorithms, But what about tuning.

Speaker 2

26:31

Ah hyper parameter tuning. This is really interesting. It's about tweaking the settings of the algorithm itself before it starts learning.

Speaker 1

26:39

So not changing the data, but changing how the algorithm learns from the data exactly.

Speaker 2

26:44

Things like the K and kN N, or the C penalty termin SVM, or how deep you let a decision tree grow. These are hyper.

Speaker 1

26:52

Parameters, and how do you tune them?

Speaker 2

26:54

Guessing more systematic than that, Usually you define a grid of possible values for the hyper parameters you want to tune, like Trika values from one to twenty, try different distance.

Speaker 1

27:04

Metrics, and the computer tries all the combinations.

Speaker 2

27:06

YEP, often using cross validation within the grid search. It runs models for all combinations and tells you which set of hyper parameters gave the best average performance on the cross validation folds. It's a bit a trial and error guided by data.

Speaker 1

27:19

Let's look at the kNN tuning example. Original model predicting voted was around seventy nine percent accurate.

Speaker 2

27:25

Right, the tuned K one thirteen the weights uniform versus distance based and the distance metric Manhattan versus Kowski clidion. That created ninety six different kNN models to test.

Speaker 1

27:36

Wow, and the best one.

Speaker 2

27:38

The best combination found through cross validation, achieved seventy four percent accuracy.

Speaker 1

27:43

That's lower than the original seventy nine percent.

Speaker 2

27:45

On cross validation. Yes, but here's the crucial part. When they tested that tuned model back on its own training data, its accuracy jumped to ninety nine point seven percent.

Speaker 1

27:55

WHOA nearly perfect. But that sounds suspicious.

Speaker 2

27:59

I really suspicion. That huge gap between the near perfect training accuracy and the seventy four percent cross validated accuracy screams overfitting. The tuning process found settings that basically memorize the training data.

Speaker 1

28:11

So this is a massive warning sign about just looking at training accuracy, especially after tuning.

Speaker 2

28:15

Absolutely reinforces why cross validation and that final untouched test set are non negotiable.

Speaker 1

28:20

Okay, and the SVM regression tuning predicting education original mESC was about point zero one five eight nine.

Speaker 2

28:26

They tune the c's cost parameter, the kernel type linear versus RBF and degree for polynomial kernels and the result the best combo nudged the MSE down slightly to point zero one, five, three, eight.

Speaker 1

28:37

A tiny improvement worth it.

Speaker 2

28:39

Depends entirely on the context. Sometimes shaving even a tiny bit off the air can be hugely valuable. It shows the potential, even if it's not always dramatic.

Speaker 1

28:48

Finally, taking it one step further, combining algorithms the ensemble approach or stacking.

Speaker 2

28:54

Yes, stacking is a pretty sophisticated way to combine models. The basic idea is you train several different models, like a random forest, a SVM, a kNN okay. Then instead of just averaging their outputs, you use their predictions as inputs for a final metal model.

Speaker 1

29:10

So a model that learns from the predictions of other.

Speaker 2

29:12

Models exactly, it learns how to best combine the strengths and potentially correct the weaknesses of the base models to make a final, hopefully better prediction.

Speaker 1

29:20

Trying to create a supermodel.

Speaker 2

29:21

That's the goal maximize strengths, minimize weaknesses.

Speaker 1

29:24

The example used the doctor Aus data set again predicting gender, combining random forest, SVM, and kNN YEP individually.

Speaker 2

29:33

Cross validation showed random forest was the best performer on its own well, SVM struggled a.

Speaker 1

29:37

Bit, So did stacking them beat the random forest?

Speaker 2

29:41

Interestingly? No, No, the initial stacked model actually performed worse than just using the random forest by itself.

Speaker 1

29:48

Huh So more complex isn't always better?

Speaker 2

29:51

Definitely not automatically. Just throwing models together can sometimes add noise or compound errors.

Speaker 1

29:56

Did they try tuning the models within the stack?

Speaker 2

30:00

Did they tuned hyper parameters for the random forest, the SVM, and the kN N within the ensemble.

Speaker 1

30:06

Structure and the final result after all that work, The.

Speaker 2

30:09

Final tuned ensemble model landed at about seventy percent accuracy on both training and test data.

Speaker 1

30:15

So after all that stacking and tuning, it didn't really improve much, if at all, over the simpler single random forest model.

Speaker 2

30:23

Pretty much in this specific case, the added complexity didn't yield significantly better performance.

Speaker 1

30:28

That's a really important takeaway, A huge one. Wow. Okay, we've covered a lot of ground today, from decision trees kind of the building block.

Speaker 2

30:34

To random forests using the wisdom of the.

Speaker 1

30:36

Crowd, kNN using proximity, SVM finding that optimal boundary an n's sort of mimicking the brain, then.

Speaker 2

30:44

K meanes finding hidden groups without even being told what to look.

Speaker 1

30:48

For and crucially, how to actually measure if these models are any good using things like the confusion matrix, RC curves, cross validation.

Speaker 2

30:57

And how to potentially improve them through careful hyper parameter tuning and even stacking them together.

Speaker 1

31:02

And remember, listener, these aren't just you know, abstract ideas. This deep dive should give you a real foundation for understanding how these tools work, yeah.

Speaker 2

31:10

Their quirks, their strengths, their weaknesses, and importantly, how to think critically about their results and choose the right approach for your specific problem.

Speaker 1

31:19

And that final point from the ensemble example really sticks with me. As the source material concludes, and this is powerful. Complexity is not a cure all for better performance, and simple is almost always better.

Speaker 2

31:32

It's a fantastic lesson, it really is.

Speaker 1

31:33

And it leaves us with a question for you to maybe mull over in whatever you're working on, whether it's data or just a life, where might you be over complicating things? Where could a simpler approach actually lead to clearer, more powerful insights.

Speaker 2

31:47

Something to think about.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript