Deep Learning from Scratch: Building with Python from First Principles

Speaker 1

00:00

So if you ask an artificial intelligence to write a Shakespearean saunet about I don't know a toaster.

Speaker 2

00:06

Right, it just does it in like three seconds.

Speaker 1

00:08

Flat, exactly. It's so incredibly fast and honestly so convincingly human that it's really easy to just throw our hands up and say, well, the computer is just thinking.

Speaker 2

00:18

Yeah, that's the illusion.

Speaker 1

00:19

But here is the secret that the tech world sort of you know, blides right past. The AI doesn't actually know what a toaster is.

Speaker 2

00:29

No, not at all.

Speaker 1

00:30

It doesn't know what a poem is. It's not experiencing this burst of creative genius underneath the hood. It is really just doing a massive amount of incredibly fast, very boring accounting.

Speaker 2

00:42

Which is exactly what we're going to get into.

Speaker 1

00:44

Right. So today, for you, our listener, we're opening that ledger. We are taking you on a custom tailored deep dive to totally demystify how artificial intelligence actually you know, learns.

Speaker 2

00:55

Yeah, no magic, no impenetrable labyrinths, just the raw mechanics.

Speaker 1

00:59

The mechanics.

Speaker 2

01:00

Because I mean, we appreciate the result of a neural network, right, we rarely understand the underlying chemistry of how it actually got there.

Speaker 1

01:07

It's totally a black box for most people exactly.

Speaker 2

01:10

So our guide for pulling back the curtain today is this fantastic book by Seth Widman. It's called deep Learning from Scratch, Building with Python from first principles A great resource, really is. And what we're going to do is take all that intimidating jargon, you know, the algorithms, the calculus, the hyper parameteris very stuff, all of it, and we're going to strip it all the way down to the foundational floorboards.

Speaker 1

01:33

So we're going to use simple math, visual diagrams, and some basic code to show you that deep learning is really just a highly scaled assembly line of very very simple mathematical factories.

Speaker 2

01:44

That's a great way to put it.

Speaker 1

01:46

But before we actually get to building that assembly line, we need to talk about why loving this stuff in the first place is normally such a complete nightmare. Oh it really is, because if you try to read a standard academic paper on neural networks, it often feels like you're trying to read ancient Greek. Why is the entry point so brutal?

Speaker 2

02:04

Well, Wideman tackles this right out of the gate. He uses that old parable of the blind men and the elephant.

Speaker 1

02:10

Oh right, sure.

Speaker 2

02:11

So you have a group of blind men who encounter an elephant for the first time. One touches the trunk and says, oh, an elephant is like a thick snake, right, Another touches the year and says, no, it's a fan. Another grabs the leg and says it's a tree trunk.

Speaker 1

02:25

And they're all kind of right but also completely wrong exactly.

Speaker 2

02:30

They are all describing a correct, isolated part, but none of them are describing the whole animal. And deep learning resources have historically done the exact same thing.

Speaker 1

02:41

Okay, that makes a lot of sense because, like, if you want to learn a standard computer science concept, say how a search algorithm works, the resources out there are usually holistic, Like a good textbook gives you a plain English explanation, then they give you a whiteboard diagram, then the math, and finally the pseudo code so you can actually build it. You get the whole elephant, right, you get the whole elephant. But AI resources fracture this, don't They.

Speaker 2

03:05

Completely The field sort of fractured into two really extreme camps. On one side, you have these highly conceptual, incredibly dense.

Speaker 1

03:15

Math textbooks, the ancient Greek exactly.

Speaker 2

03:18

Wideman points to Ian Goodfellow's famous Deep Learning book and look, it's an absolute masterpiece. Sure, but if you aren't already fluent in advanced calculus and linear algebra, you're going to hit a brook wall on like page ten, it's just a sea of abstract equations.

Speaker 1

03:34

So what's the other extreme then, Because if I don't want to drown in calculus, where do people usually go?

Speaker 2

03:40

Well, they go to the highly practical, code heavy tutorials. So you might look up the documentation for a modern library like PyTorch okay, and you just copy a block of Python code, you paste it, you run it, and you watch this number on your screen called the loss value start to go down.

Speaker 1

03:57

Which means it's working right.

Speaker 2

03:59

Technically, Yes, the network is learning, but the tutorial never actually stops to explain the why. It's like you're driving a sports car but you have zero clue how the internal combustion engine works.

Speaker 1

04:09

Which is I'm guessing where Widman's approach comes in. He argues, you have to merge these perspectives.

Speaker 2

04:15

Yes, exactly. His core thesis is that to truly understand neural networks, you have to hold multiple mental models in your head simultaneously.

Speaker 1

04:23

Okay, what does that look like?

Speaker 2

04:25

Well, you have to look at a neural network and see it as a mathematical function, but at the exact same time, you have to see it as a computational graph where data physically flows from left to right. Got it. You also have to see it as a series of layered neurons, and finally, you have to understand it conceptually as a universal function approximator.

Speaker 1

04:44

Wait, hold on, a universal function approximator. Yeah, that sounds like a fancy blender from a late night infomercial or something. What does that actually mean?

Speaker 2

04:55

I know it sounds super intimidating, but it just means a machine that can mold it self to approximate literally any pattern in the universe, provided it has enough parts. Any pattern, pretty much, whether the pattern is predicting tomorrow's weather, or recognizing a cat in a photo, or translating English to French. If there's a logical relationship between the input

05:16

and the output, a neural network can approximate it. That's wild, it is, But you only realize how it does that if you force yourself to see the math, the diagram and the code side by side and.

Speaker 1

05:27

I guess that's why Widman forces the reader to build these networks from scratch in Python, using just like basic arrays.

Speaker 2

05:34

In numpis exactly.

Speaker 1

05:35

It's not because you're trying to build the fastest AI in the world. It's purely an exercise in solidifying your understanding of those models.

Speaker 2

05:42

Spot on.

Speaker 1

05:42

So let's start doing exactly that for the listener. Let's abandon the complex terminology. We need to start at the absolute foundation of all machine learning, the mathematical function and the derivative right.

Speaker 2

05:55

And usually when we learn about functions in high school, we use the Cartesian plane, you know, Rineiti Card's classic X and Y.

Speaker 1

06:02

Axes, the good old graph paper.

Speaker 2

06:04

Exactly, you plot some points, you draw a curved line through them, and that's fine for basic geometry, but it's actually a terrible mental model for deep learning.

Speaker 1

06:13

Yeah, drawing parabolas isn't going to help us build an AI. Instead, Widman tells us to visualize a function as a mini factory, just a physical box sitting on a table. Inputs go into the box on a conveyor belt. The factory has some internal strict rules that it applies to whatever comes in, and then a transformed output comes out the other side precisely.

Speaker 2

06:35

So let's say the factory is a square function. Okay, you send the number two into the factory. The factory's internal rule is to multiply the input by itself, so outcomes the number four. You send in a three outcomes of nine. It's just a simple predictable machine.

Speaker 1

06:49

Okay. So if the function is just a factory box, what is a derivative? Because just hearing the word derivative definitely triggers some traumatic math flashbacks for me.

Speaker 2

06:58

Oh for sure. But let's dick with our factory visualization. Yeah, imagine there is a physical string connecting the input of the factory to the output of the factory, a string.

Speaker 1

07:07

Okay.

Speaker 2

07:07

The derivative is simply asking a very practical question. If you pull on the input string by a very very small amount, a tiny nudge like point zero zero zero one, by what multiple does the output string move?

Speaker 1

07:20

Ah? Okay, So it's kind of like adjusting an analog volume knob on an old stereo. If I nudge the input dial just a tiny fraction of a millimeter, how much louder does the music actually get? Like? Does a tiny nudge on the input cause a massive blown speaker spike in the output, or does it barely move the needle at all?

Speaker 2

07:41

Exactly? You're measuring the rate of change.

Speaker 1

07:43

But okay, why is this tiny nudge so crucial? Like why does an artificial intelligence care so much about this little string?

Speaker 2

07:51

Because this rate of change? Knowing exactly how the input affects the output is the literal engine of machine learning. Yes, it is how the model knows how to correct its own errors. Think about it. If an AI makes a prediction and that prediction is wrong, it needs to know how to fix it right.

Speaker 1

08:10

It has to adjust.

Speaker 2

08:11

And if the AI knows exactly how a tiny nudge to its internal settings will affect the final outcome, it knows exactly which dials to turn and in which direction to get a better result next time. The derivative is basically the compass pointing toward the correct answer.

Speaker 1

08:24

I see. Okay, so we have a single mini factory. You nudge the input, you watch the output change, you adjust the dial. That makes sense, but predicting a housing price or writing a poem takes way more than one mathematical step. Real data doesn't just go through one simple rule. So how do these boxes actually talk? To each other without losing all the data.

Speaker 2

08:43

So this brings us to the concept of nested functions. In deep learning. You almost never have just one factory.

Speaker 1

08:50

You have a chain of them, an assembly line.

Speaker 2

08:52

Exactly an assembly line. The output conveyor belt of factory one feeds directly into the input conveyor belt a factory two one transforms the raw data, passes it to factory two, which transforms it again, and so on.

Speaker 1

09:05

Okay, but if I nudge the input at the very beginning of the assembly line, that ripple has to travel through every single factory to reach the end. How do we track that string across ten different boxes.

Speaker 2

09:16

We use what might be the single most important mathematical rule in all the deep learning, the chain rule from calculus, the chain rule. Yes and again, Wideman demystifies this beautifully using the factory boxes.

Speaker 1

09:28

Okay, let's trace the string. Then, let's say we have two boxes. We pull the string on the input to box one. We observe that its output changes by a factor of three, so a three in a multiplier. Right, That output is now the input for box two. And we already know that if we tweak the input of box two, its output changes by a factor of say migus two units.

Speaker 2

09:47

Perfect setup. So to find the total change across the entire chain from the very first input to the very last output, the chain rule says, we simply multiply those rates of change together. Just multiply them, just multiply them. One changes things by a factor three, box two changes things by a factor of niggas two. The total change across the whole chain is three multiplied by niggas two, which equals negative six.

Speaker 1

10:10

Oh wow, so a one unit nudge at the start creates a null six unit shift at the very end.

Speaker 2

10:14

Exactly.

Speaker 1

10:15

But wait, practically speaking, if I'm actually coding this assembly line, how does the system know those numbers? Like?

Speaker 2

10:20

Do?

Speaker 1

10:20

I have to run the data all the way forward to get an answer, and then somehow trace my steps all the way backward to figure out the chain rule math.

Speaker 2

10:26

That is exactly what you have to do. To code this from scratch. Your system has to make two distinct passes. First is the forward pass. Okay, forward, You feed your initial data into the first factory and you let it run all the way down the assembly line. But here's the catch. As the data moves forward, the system has to save all the intermediate quantities at every single step. It has to keep a meticulous record of what happened inside each box.

Speaker 1

10:54

Why doesn't need to save all that? If it reaches the end and gets an answer, hasn't it done its job?

Speaker 2

10:58

Because of the second step, the backward pass. Once the data reaches the end and then network spits out a prediction, you compare that prediction to the correct answer to see how wrong you were. Then you run backward down the assembly line. You use all those intermediate records you save during the forward pass to calculate the derivatives the strings. Going backward, you calculate box two string, then multiply it by box one string using the chain rule, all the way back to the start.

Speaker 1

11:24

I'm not going to lie. That sounds incredibly tedious to code by hand, keeping track of every single variable, saving it all in memory, running backward, multiplying the strings. It sounds like an absolute nightmare of bookkeeping.

Speaker 2

11:37

It is. It's a massive bookkeeping operation. Yeah, and this is exactly why modern deep learning libraries like PyTorch are so popular.

Speaker 1

11:44

Today because they do it for you exactly.

Speaker 2

11:47

They use something called automatic differentiation. They handle all that tedious forward and backward bookkeeping completely invisibly. You just define the factories and the library does the calculus for you.

Speaker 1

11:58

But Widman forces you to it by hand anyway, right, he does, because if you just rely on PyTorch, you're back to being a blind man touching the elephant. You don't see the whole process exactly.

Speaker 2

12:08

By coding the forward and backward passes from scratch in Python, you actually see the mechanics. You realize that learning isn't consciousness. It's literally just a series of multipliers passed backward down an assembly line.

Speaker 1

12:19

Okay, I'm with you on the strings in the assembly line, But single numbers are great for theory. Reality is messy.

Speaker 2

12:26

Very messy.

Speaker 1

12:27

If I want an AI to predict a housing price, I'm not just feeding it a single number. A house has dozens of features, square footage, number of bedrooms, age of the roof, proximity to a highway. So how do we pull a string on a massive spreadsheet of information?

Speaker 2

12:45

This is where we scale up to matrices and supervised learning sew of us. Learning is just finding relationships between characteristics that have already been measured, Okay, and to process all those characteristics, we can't use single numbers. We have to stack the data into grids, which in numb pie are called end arrays or n dimensional arrays.

Speaker 1

13:03

Right. So, if you visualize a spreadsheet, the columns are the features like bedrooms, square footage, and every specific house you are evaluating becomes a row.

Speaker 2

13:12

Yep.

Speaker 1

13:13

So a two x two grid might be two houses each with two features exactly.

Speaker 2

13:16

Now, when this grid of data enters the first factory, the model needs a way to evaluate it. It performs what's called a weighted sum.

Speaker 1

13:23

A weighted sum.

Speaker 2

13:24

Right. It looks at the features and decides how important each one is. Does the square footage matter more than the age of the roof? It assigns a mathematical weight to each feature.

Speaker 1

13:33

Okay, let me guess how this works mathematically. If I have a column for bedrooms in a way that says bedrooms are very important, is the factory just doing a dot product like matching them up?

Speaker 2

13:46

Yes. Think of a dot product as a matching game. The factory lines up the house's features in one hand and its internal priority weights in the other.

Speaker 1

13:55

Okay.

Speaker 2

13:56

It matches the bedrooms to the bedroom weight, multiplies them together, matches the square footage to the square footage weight, multiplies them. Then it throws all those paired results into one single bucket and adds them up.

Speaker 1

14:08

That's the sum, right, But if you keep multiplying features by weights, that bucket is going to overflow real fast. I mean, a three thousand square foot house multiplied by a heavyweight becomes a massive number. Do we just let the numbers get infinitely large?

Speaker 2

14:21

We can't, which is why we usually feed that bucket into another factory right afterward, typically something called a sigmoid function.

Speaker 1

14:28

A sigmoid function, we haven't covered that one. What's that?

Speaker 2

14:30

A sigmoid function is basically a squishing factory.

Speaker 1

14:32

A squishing factory.

Speaker 2

14:33

Yeah, it takes whatever wild massive number comes out of the weighted sum, and it brutally compresses it into a manageable decimal between zero and one.

Speaker 1

14:42

Oh.

Speaker 2

14:42

I see this is incredibly useful if you just want the network to give you a probability, like a point eight chance that the house is a goodbye, rather than outputting a raw score of four million.

Speaker 1

14:51

Okay, so our assembly line is now take the matrix of houses, match them with weights, sum them up, and then squish them through a sigmoid factory to get a probability.

Speaker 2

15:01

You got it.

Speaker 1

15:02

I get the forward pass. But here's where my brain completely breaks. To do the backward pass, we have to pull the string to correct the errors. How on earth do you track the derivative of a giant grid of interacting numbers. Every row and column is interacting with every weight. The calculus must just explode into absolute chaos, you would.

Speaker 2

15:23

Think so tracking every single string individually across a massive matrix would be impossible, right, But the math looks incredibly messy on a whiteboard, while the resulting code is brilliantly, shockingly clean. It's a magical property of linear algebra.

Speaker 1

15:39

WHOA, I would stop right there, time out. You literally cannot start this deep dive by promising no magic and then tell me the math relies on a magical property that is totally cheating. Explain it. Why does the matrix math clean up so nicely?

Speaker 2

15:54

Fair? Catch? Okay, you're right, no magic. It comes down to something called matrix transposition.

Speaker 1

15:58

Matrix transposition.

Speaker 2

16:00

Yes, when you need to compute the backward pass the gradient for a giant grid of weights, the chain rule dictates that you don't actually have to calculate a million individual strings. Instead, you take the input matrix, and you simply transcose it. You flip it on its side.

Speaker 1

16:16

Meaning the rows become columns and the columns become rows.

Speaker 2

16:19

Exactly, And why does this work mechanically? Think of the forward pass like a river flowing downstream, splitting into hundreds of tiny branches. Those are your data points interacting with weights.

Speaker 1

16:29

Okay, I picture it.

Speaker 2

16:30

If you want to send an error signal back up the river to the exact source that caused it, you just referse the map. Flipping the matrix on its side perfectly re routes the air signals backward along the exact same mathematical paths the data used to travel forward.

Speaker 1

16:43

Oh wow, So you aren't doing entirely new chaotic math to go backward, not at all. You're just taking the infrastructure you build going forward, turning it sideways, and letting the error flow back to the correct weight.

Speaker 2

16:54

Precisely because of how matrix transposes work out mathematically, this incredibly common plex web of interacting data collapses into a few incredibly simple lines of Python code. During the backward pass. It scales perfectly.

Speaker 1

17:08

That is wild. So it doesn't matter if I'm feeding the factory a single two x two grid or a massive matrix with a million rows representing every house in the country. The logic of the assembly line stays exactly the same.

Speaker 2

17:21

Exactly the same.

Speaker 1

17:22

The forward pass runs the matching game and squishes the numbers. The backward pass flips the map on its side to route the blame, runs the chain roll and updates the weights.

Speaker 2

17:31

And that is why we can have AI models today with billions or even trillions of parameters. The fundamental architecture, the mini factory, the chain roll, the matrix transposes. It's infinitely scalable. You just need more powerful computers to run the assembly line faster.

Speaker 1

17:46

Okay, let's bring this all together for you, the listener. We started this deep dive staring at a hidden circuitry that everyone just assumes is impenetrable, but by looking through the lens of Seth Weidman's work, we've stripped it down. We have deep learning is an assembly line of mini factories. We have inputs that flow forward through nested functions, matching features to weights. We save our math as we go.

18:10

Then we compare our final answer, and we pull the strings backward, flipping the matrices to calculate exactly how to adjust our internal dials. It's not a brain, it's just very fast, very elegant bookkeeping. You now have the foundational mental model for how machines actually learn.

Speaker 2

18:29

It's a very empowering realization, honestly, to finally see the gears turning. But this actually raises a really important question, and it's the thought I want to leave you with today. It goes all the way back to the very first step of this entire process supervised learning.

Speaker 1

18:42

You mean, like setting up the grid of numbers in the first place.

Speaker 2

18:44

Right. Widman points out that to use these beautiful mathematical assembly lines, we have to translate the messy, ambiguous real world into precise numbers.

Speaker 1

18:54

Yeah.

Speaker 2

18:54

In our example, we chose price to perfectly represent a house's value. The market decides the price, so mathematically that works. But what happens when we try to force incredibly complex human concepts into a single numeric matrix just to make the math run.

Speaker 1

19:10

So wait, if we're building a hiring algorithm, we have to somehow turn a concept like desirability or work ethic into a column on a spreadsheet exactly.

Speaker 2

19:20

Or if we're building a loan approval model, we have to quantify something like reliability. We have to convert nuanced, deeply human ideas into cold, hard numbers so our minifactories can actually process them.

Speaker 1

19:31

So if we use something like say, zip codes to help predict loan defaults, the math might work perfectly on the assembly line, but we've accidentally built a machine that mathematically justifies redlining es. The bias isn't in the AI's brain. The bias is baked into the columns of the spreadsheet before the forward pass even starts.

Speaker 2

19:47

Precisely, if the factory only knows what we put on the conveyor belt, then the very first step of deep learning, choosing which numbers represent reality, might actually be its biggest vulnerability.

Speaker 1

19:59

That's terrible wifying.

Speaker 2

20:00

Actually, it's a huge issue. Are our models objectively learning the truth about the world or are they just efficiently mathematically learning the biases we hardcoded into the system the moment we decided what to measure.

Speaker 1

20:13

That completely flips the script. We spent this whole time demystifying the machinery inside the factory. We learned how the boxes work, how the chain rule connects them. But maybe the real question we should be asking isn't how the factory processes the materials.

Speaker 2

20:27

Who is deciding what materials are allowed on the conveyor belt in the first place.

Speaker 1

20:31

Exactly something for you to ponder until next time.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript