Data Science from Scratch: First Principles with Python

Speaker 1

00:00

Imagine you're managing a nationwide campaign, right, Okay, you have this massive database, like a million potential voters, but you only have the time and the budget to knock on I don't know, ten thousand.

Speaker 2

00:13

Doors, right, extremely limited resources.

Speaker 1

00:15

Exactly, So how do you mathematically guarantee that you're knocking on the exact right ones? Or or take a massive retailer like Target.

Speaker 2

00:23

Oh yeah, the famous pregnancy example.

Speaker 1

00:25

Yes, how do they know a customer is pregnant and can start sending them coupons for baby clothes before that person has even told their own family.

Speaker 2

00:34

I mean it sounds like corporate espionage or literal mind reading, it really does, But it's actually just recognizing patterns in you know, seemingly mundane data like unscented lotion purchases suddenly correlating with a second trimester, or, in the case of the twenty twelve Obama campaign, finding the exact combination of demographic data and pass voting behavior that signals that a person just needs a very slight nudge to actually show up at the bulls, Which.

Speaker 1

01:00

Is wild, right, because we see the end results of these models every single day. You tap a glossy button on a smartphone, a little progress bar spins, and an app confidently tells you who to date, what to buy, what movie to watch next.

Speaker 2

01:14

Yeah, it's curated reality exactly.

Speaker 1

01:16

There's this expectation of seamless magic. We like things to be hidden behind a sleek interface.

Speaker 2

01:22

Which is exactly why so many people get a massive shock when they actually step into the discipline of data science itself. Oh for sure, because that glossy interface it's entirely stripped away. You are suddenly looking at a landscape that is raw, messy and honestly totally unforgiving. Yeah, and a concerning number of practitioners today are they're overly reliant on shiny, prepackaged libraries.

Speaker 1

01:46

The black boxes.

Speaker 2

01:48

Exactly. They plug data into a black box, Yeah, and they just trust whatever comes out, without having any idea how the underlying engine actually functions.

Speaker 1

01:55

Well, that is the exact trap we are dismantling today. Welcome to our deep dive into Jowel gris Verus book Data Science from Scratch First Principles with Python.

Speaker 2

02:03

It's a fantastic text.

Speaker 1

02:05

It really is. And our mission for this deep dive is to basically throw out the automated frameworks. No massive black box libraries doing the heavy lifting for us. We are going to look at the raw.

Speaker 2

02:18

Mechanics right down to the studs.

Speaker 1

02:20

Exactly. By the end of our conversation, you are going to understand how fundamental algorithms process information. We're moving from basic data structures into the mechanics of visualization, all the way down to the bare metal of linear algebra.

Speaker 2

02:34

Because data science isn't about memorizing syntax to import a machine learning library, right, It's about answering questions that no one has even thought to ask yet. Yeah, and those answers are buried in this massive glut of everyday information. You can only extract them if you fundamentally understand the tools you're holding.

Speaker 1

02:52

Okay, let's unpack this because before we look at a single line of actual code, the book sets up this brilliant hypothetical sandbox for us to play in the startup. Yeah, you've just been hired a data science sstor, which is a fictional social network built exclusively for data scientists.

Speaker 2

03:08

I mean, it's a great framing device. So on your very first day, the VP of networking drops a massive data dump on your desk as they do, right, and gives you your first assignment identify the key connectors among the users. But this data isn't sitting in some neat, easily searchable database.

Speaker 1

03:27

No it's not.

Speaker 2

03:28

It is raw Python data structures. You get a list of users where each person is represented by just a simple dictionary.

Speaker 1

03:35

Right. So I'm looking at this list in the book, and it's just basic pairs. User ID zero is named hero, User ID one is done, just raw text, yep. And alongside that you get the friendship data. But it's not a visual web of connections. It's just a list of tuples exactly.

Speaker 2

03:50

And a tuple is just an immutable or unchangeable sequence of elements. In this case, it's just a pair of IDs. Okay, So if you're as of friends with done, the data just shows a tuple parenthesis zero coma one parenthesis. That is your entire social graph. It's entirely abstract.

Speaker 1

04:05

So if I'm tasked with finding the most important person in this abstract graph, my immediate instinct is to just I don't account who has the most tupuls with their ID in it, just do a raw headcount.

Speaker 2

04:17

And that is the concept of degree centrality. You're simply asking who has the most direct connections, you sum them up, sort the list, and whoever is at the top is theoretically your most central figure.

Speaker 1

04:28

But I'm looking at that and thinking that's just high school popularity pretty much. Yeah, it's measuring who sits with the most people at lunch. But does having a high raw headcount actually make you the most critical node in a flow of information?

Speaker 2

04:41

It rarely does, and Grease highlights this exact flaw by introducing an anomaly in the data Sciencestor.

Speaker 1

04:47

Network right the Dune and Thor situation exactly.

Speaker 2

04:50

So, if you look at the raw numbers, the user Done, who is ID one, has three direct friends, but another user Thor ID four, only has two. So if you blindly run a degree centrality algorithm, it ranks Done as more important than Thor.

Speaker 1

05:05

Wait, but if I actually map out these connections visually, Thor is sitting dead in the middle of a chasm. He's the only link bridging two completely separate, isolated clusters of users.

Speaker 2

05:16

Precisely, if Done leads the network, his three friends can still talk to each other through other paths WHOA. But if Thor leads a network, it literally breaks in half. The flow of information stops entirely. Thor is a bottleneck. Intuitively, that makes him far more critical to the network than Done despite having fewer direct friends.

Speaker 1

05:36

So if I just used a prepackaged function that calculates popularity, I would have handed my boss the wrong name on day one.

Speaker 2

05:42

You would have And that proves why you can't blindly trust a default metric. You have to look at the structure of the data. You have to build functions that look at say mutual friends or shared interest. And to build those functions you have to use the language, in this case Python, But you don't just write Python. The book Stress is right pythonic.

Speaker 1

06:00

Code, which is a whole different mindset. And I noticed the book adhees specifically to Python two point seven, which, while it introduces some serious quirks into how we handle data.

Speaker 2

06:09

Oh the division quirk.

Speaker 1

06:11

Yes, if I open Python two point seven and type five divided by two, it doesn't give me two point five. It spits out two.

Speaker 2

06:17

Because of how it handles integer division, it truncates the decimal entirely unless you explicitly tell the environment to use floating point math.

Speaker 1

06:26

Which seems so counterintuitive.

Speaker 2

06:29

It is you literally have to type from future import division to make basic math behave the way human expects it to. It's I mean, it's a harsh reminder that the underlying environment shapes the reality of your data.

Speaker 1

06:41

That feels incredibly rigid, but honestly not as rigid as white space formatting. Oh yeah, I'm used to seeing code wrapped in curly braces or having explicit end statements, but Python relies entirely on indentation.

Speaker 2

06:52

Right. It forces you to write readable code. There's no visual clutter. But the trade off is that it's strictly enforced.

Speaker 1

06:59

Meaning if you mess up a space.

Speaker 2

07:01

If you're copying and pasting a block of logic to analyze, say, user salaries, and you have one accental space, the entire script crashes. It forces discipline, but the reward for that discipline is access to incredibly elegant tools right out of the box, things like default, picked encounter.

Speaker 1

07:19

Okay, let's start there, because I want to understand the mechanics of why these are so vital. Let's say my next task at data Sciencestor is to figure out what our users care.

Speaker 2

07:29

About, Okay, finding their interests?

Speaker 1

07:31

Right, I want to count how many times specific words show up in their profile bios words like hadoop or psychic learn. If I use a standard Python dictionary to keep a running tally, how does that actually execute?

Speaker 2

07:44

It's incredibly clunky. A standard dictionary throws a literal error if you try to modify a key that doesn't exist yet. Oh really yeah, So as your program reads the word hadoop for the very first time, you have to write logic that says, check if hadoop is in the dictionary. It isn't. Okay, create the key hadoop and set the value to one.

Speaker 1

08:01

That sounds exhausting.

Speaker 2

08:03

It is. Oh look, the next word is hadoop again. Check if it exists, yes, okay, increment the value by one.

Speaker 1

08:10

You are writing multiple lines of repetitive safety checks just to count words. That seems like a massive waste of processing time and just human effort.

Speaker 2

08:18

It totally is enter default dict. It intercepts that missing key error. If you ask it to modify hodup and a dupe isn't there, It gracefully creates it on the fly, assigns it a default value like zero, and then lets your code increment it. It removes all the boilerplate safety checks that.

Speaker 1

08:37

Is so much cleaner, and then counter takes that a step further.

Speaker 2

08:40

Right, Yes, counter is a subclass designed specifically for this exact problem. You just handed a raw list of a million words, and in one single line of code, it absorbs the list and spits out a map dictionary of frequency.

Speaker 1

08:54

It transforms a multi step, error prone loop into a single elegant command exactly. Oh wait, you just casually mentioned a list of a million words. When we're dealing with data at scale, moving a million items around has to cause problems, huge problems. Like if I try to assign a list of a million numbers to a variable, Python is literally allocating memory for all one million of those numbers instantly, right correct, And if.

Speaker 2

09:17

Your data set is large enough, your machine will completely choke. It runs out a RAM and crashes. This is where the book introduces the concept of lazy.

Speaker 1

09:27

Evaluation using generators.

Speaker 2

09:29

Specifically using generators.

Speaker 1

09:31

I always picture this well, it's like watching a movie. If I use a standard list, it's like downloading a massive four K movie file to my hard drive. I can't watch a single second of it until the entire one hundred gigabyte file is sitting in my memory.

Speaker 2

09:45

That's a perfect analogy, but.

Speaker 1

09:46

A generator is like streaming it. I'm just pulling the exact frame I need, exactly when I need it, and then discarding it to make room for the next frame.

Speaker 2

09:53

That is an excellent way to visualize it. A generator yields values one at a time. It pauses its state, waits for you to ask for the next value, and only computes what is strictly necessary in that moment.

Speaker 1

10:05

Which is incredibly memory efficient. Extremely Okay, But if streaming is so much lighter on my system, why am I ever downloading the file? Like? Why wouldn't I use a generator for absolutely everything in my data pipeline?

Speaker 2

10:17

Ah? Because a generator is ephemeral. You can only iterate through it once.

Speaker 1

10:21

Wait, really, once I read it, it's just gone.

Speaker 2

10:24

Exactly. If you have a massive data set of user interactions and you need to loop through it to find the average, and then loop through it again to find the standard deviation, a generator will be completely exhausted after that first pass.

Speaker 1

10:37

Oh wow, I didn't realize that.

Speaker 2

10:39

Yeah, you would have to recompute the entire stream from scratch for the second pass. So the engineering challenge is constantly balancing memory efficiency against how many times you actually need to interact with that specific data set.

Speaker 1

10:51

Okay, all of this, the dictionaries, the generators, it's beautiful logic. But all of these elegant Python tools are really just for us, the develops. Sure, if I take a default dict full of raw frequencies and drop it on the VP of networking's desk, their eyes are going to glaze over. You have to translate those numbers into a visual space, which.

Speaker 2

11:10

Brings us to the visual translation layer matt plotlib.

Speaker 1

11:13

The book touches on the standard toolkit, you know, bar charts for buckets of discrete data, line charts for continuous trends, scatterplots for pairing two variables together to see if they correlate. The basics, right, but we don't need to dwell on what a line chart is. What's interesting is that the book spends a very specific amount of time warning the reader about the mechanics of visual deception.

Speaker 2

11:36

Yes, the trap of the misleading act axis. Yeah, when you're translating raw numbers into a visual space, you are essentially creating a narrative, and that narrative is incredibly easy to manipulate.

Speaker 1

11:46

Here's where it gets really interesting. The example Groose uses is brilliant. Let's say we are tracking how many times the phrase data science is mentioned on user profiles YEP. The data shows that in twenty thirteen there were five hundred mentions. In twenty fourteen there were five hundred and five mentions.

Speaker 2

12:02

That is an increase of exactly five Right.

Speaker 1

12:04

So if you plot those two bars on a chart and you start your YAG access at zero, the difference between five hundred and five oh five is basically imperceptible.

Speaker 2

12:12

It looks like a flatline. The narrative there is growth has completely stagnated.

Speaker 1

12:17

But what if I want my boss to think I'm doing an amazing job growing the platform. I go into my plotting tool and I manually force the axis to start at four ninety nine and end at five.

Speaker 2

12:26

Oh six, And suddenly the twenty fourteen bar is towering over the twenty thirteen bar. It visually implies this massive explosive increase.

Speaker 1

12:37

It's the visual equivalent of taking a photo of a puddle, cropping it incredibly tight so you can't see the edges, and trying to convince the viewer it's the ocean.

Speaker 2

12:45

That's exactly what it is.

Speaker 1

12:46

If it is that easy to lie with data. How can anyone trust a chart in a corporate presentation.

Speaker 2

12:52

Well, what's fascinating here is this vulnerability is exactly why the premise of the book is so vital. How So, when you use black box visualization libraries, they often auto scale the axes based on the minimum and maximum values of the data provided. Oh, I see a tool might automatically crop the access to four ninety nine without any malicious intent, just to save space on the screen.

Speaker 1

13:13

So the software might accidentally lie to.

Speaker 2

13:15

Me, yes, or a human might do it intentionally to sell you a false narrative. But when you learn Matt plotlib from scratch and you physically write the code to define the axis limits, you see how it works exactly. You internalize the mechanics of how the visual is constructed. You train your brain to instantly look at the axis before you let the shape of the line influence your judgment. You're basically immunizing yourself against bad data science.

Speaker 1

13:41

I love that immunizing yourself, But I want to push deeper into the actual geometry of what we're doing here.

Speaker 2

13:46

Okay, let's do it.

Speaker 1

13:46

When we take two variables. Let's say we're plotting a user's number of friends against the minutes they spend on the site. Every day we put a dot on a scatter flot, we are essentially placing a vector in a mathematic space.

Speaker 2

14:00

Correct and to truly grasp how algorithms find patterns in that space, we have to strip away the Python syntax, strip away the visual charts, and look at the hidden architecture that runs the entire.

Speaker 1

14:12

Discipline, which is linear algebra.

Speaker 2

14:15

The dreaded linear algebra.

Speaker 1

14:16

That is the phrase that makes half the room breakout in a cold sweat.

Speaker 2

14:19

I know, I know.

Speaker 1

14:20

But abstractly, vectors are just objects that can be added together or multiplied by scalers to form new vectors.

Speaker 2

14:27

And concretely, for our purposes anyway, there are simply points in a finite dimensional space. Representing user data as vectors is the foundational trick of machine learning. The text gives a very grounding example a person's physical attributes. You have a list of three numbers seventy one, seventy and.

Speaker 1

14:46

Forty meaning seventy inches tall, one hundred and seventy pounds, and forty years.

Speaker 2

14:50

Old exactly, And Python is just a list. But mathematically it is a single coordinate in three dimensional space.

Speaker 1

14:57

Okay, so I have this coordinate. But the book forces a to actually do math with these lists from scratch. If I want to add two user vectors together, I can't just put a plus sign between two Python lists.

Speaker 2

15:08

No, it doesn't work like that. By thought would just stick the two lists together and in right.

Speaker 1

15:12

So the book introduces the zip function. Walk me through the mechanics of that.

Speaker 2

15:15

Vectors must be added component wise. The first element of vector A adds to the first element of vector B. So if vector A is one two and vector B is two one, the zip function acts like a physical zipper. It takes the first element from both lists, the one and the two, and binds them into a pair. Then it takes the second elements the two in the one and binds them.

Speaker 1

15:37

And once they're paired up, the book uses a list comprehension to iterate through those pairs, add them together, and spit out a new vector three three.

Speaker 2

15:46

From there you build a function for the dot product, which is just multiplying those matching pairs and summing up the total.

Speaker 1

15:52

Let me stop you right there, because I understand the mechanics of what you just described, But why are we doing it? Why does an algorithm care about the sum of multiplied components.

Speaker 2

16:03

Because the dot product gives you the magnitude of a vector, and more importantly, it allows you to calculate the angle between two vectors.

Speaker 1

16:10

And why does the angle matter?

Speaker 2

16:12

This is how algorithms determine similarity. If you plot the interest of two users as massive vectors, and you calculate the cosine of the angle between them using the dot product, the math tells you exactly how similar those two people are. A small angle means their vectors are pointing in the exact same direction. They like the exact same things. That is literally how Acupid knows who you should date.

Speaker 1

16:34

That is fascinating, But I have to be completely honest here. I'm looking at the sheer amount of code it takes to build a vector ad function from scratch using zip and list comprehensions. Highly optimized libraries like numpi can execute vector addition across millions of data points in a fraction of a millisecond.

Speaker 2

16:54

Well absolutely so, groose.

Speaker 1

16:55

Forcing us to manually zip lists together feels a bit like hazing. Are we doing this?

Speaker 2

17:01

It's a fair question, but think of it this way. If you don't understand basic arithmetic, a pocket calculator seems like a magic box. You punch in some numbers and it spits out the truth, right. But if you accidentally hit the division key instead of the multiplication key, the calculator will spit out a completely absurd answer.

Speaker 1

17:17

And because I don't intuitively understand the math, I wouldn't even realize the answer is absurd. I'd just trust the screen exactly.

Speaker 2

17:25

Relying on numbpi without understanding linear algebra is the exact same danger, but on a massive scale. When everything works, you're fine. But what happens when you encounter the curse of dimensionality? The what the cursive dimensionality? What happens when you're dealing with data in one hundred dimensions and the distance between all your points mathematically approaches uniformity, breaking your predictive model.

Speaker 1

17:51

The black box isn't going to tell me why it broke.

Speaker 2

17:53

No, it won't, And because you don't understand the arithmetic of the vector space, you won't know how to fix it. You won't realize as your data is behaving absurdly, building it from scratch builds your mathematical intuition. It ensures you remain the mechanic, not just a passenger.

Speaker 1

18:08

That perfectly encapsulates this entire journey. I mean, we started today looking at a raw, confusing data dump from data Sciencestor. We realized that default metrics like degree centrality can completely blind us to the actual structural reality of a network.

Speaker 2

18:22

We tamed that raw data using the strict elegance of Python learning, the mechanical advantage of tools like default dick to skip boilerplate logic, and lazy generators to keep our systems from crashing under the weight of massive lists.

Speaker 1

18:35

We took those insights and visualized them, digging into the mechanics of the axis so we could actively defend ourselves against manipulated narratives. Yes, and finally, we translated our users into mathematical vectors, exploring the literal architecture of similarity that powers every recommendation engine on Earth.

Speaker 2

18:53

Knowing these first principles is your shield. Whether you were analyzing your own company's user behavior preparing for a high level strategy meeting, We're just trying to navigate a world driven by algorithms. Understanding the how protects you against the hype.

Speaker 1

19:08

I want to leave you with a final thought to chew on something that stretches beyond the scope of Python two point seven. Okay, we've spent this entire deep dive manually building these models so we can truly comprehend the how and the why. When we talk about a three dimensional vector of height, weight, and age, our human brains can picture it. We can imagine that dot floating in a room.

Speaker 2

19:30

But the bleeding edge of this field isn't operating in three dimensions exactly.

Speaker 1

19:34

Modern neural networks and depth learning systems are analyzing patterns across millions of dimensions simultaneously.

Speaker 2

19:40

Yeah, it's unfammable.

Speaker 1

19:41

They are finding mathematical correlations in vector spaces that no human mind can actually visualize or comprehend. So the provocative question is this, as these self learning algorithms become infinitely more complex, will we eventually reach a threshold where even the mechanics, the people who build system from absolute scratch, can no longer truly understand the why behind the insights the machine produces.

Speaker 2

20:07

We're moving from an era of tools we fully comprehend to an era of intelligences we merely point in a direction.

Speaker 1

20:14

We might end up right back where we started on day one, staring at an unimaginably massive humming engine block and having absolutely no idea why it's moving. Thank you for joining us on this deep dive into the foundations of data science.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript