Okay, let's unpack this today. We're embarking on a deep dive into the fascinating, sometimes maybe often hyped, but fundamentally important world of data science.
Definitely hyped at times, right, But.
Our mission here is really to demystify what data science truly is. We want to explore its core processes, the essential tools, and also confront some of the well the critical real world challenges that come with working with data.
And the ethical ones too. They're huge.
Absolutely. We'll be drawing our insights primarily from Rachel Schutt's pioneering book Doing Data Science, Straight Talk from the front Line, which came out of her course at Columbia.
University, a really groundbreaking course back then.
Exactly. So by the end of this deep dive you should have a much clearer understanding of this field, hopefully equipped with the knowledge to you know, cut through the noise and see why it's irrelevant. So let's dive straight into the heart of it. What is data science? It's a question that even the pioneers of the field really grappled with.
They really did until.
Shuts introduction to data science course at Columbia I think it started fall twenty twelve that really acted as an incubator for this whole idea.
Yeah, it was a starting point, and Kathy O'Neill form the mathbave dot org blog. She was instrumental in bringing these ideas out, specifically pushing back against all the marketing hype.
There was a lot of hype back then.
Oh yeah. A crucial point here is that initial sort of bewilderment around it. The term was just vague. People were throwing around phrases like masters of the universe for data scientists.
Right, which must have annoyed some statisticians.
You can imagine they felt like, hey, that's our feel. The science of data identity theft almost. But the core argument in the book, and I think it holds up, is that data science isn't just rebranding.
Not just a new buzzword exactly.
It's genuinely a new idea, maybe still a bit fragile or evolving, but it uniquely combines foundations from statistics and computer science. Plus it has this distinct process tied to it.
And part of that newness I think came from this idea of datification. Kenneth Kookier and Victor Merhr Schoenberger talked about this in foreign affairs maybe mid two.
Thirteeneen Riise Big Data. Yeah.
They defined datification as basically taking all aspects of life and turning them into data.
Which sounds huge.
It is. Think about it, Google glassdentifying your gaze, Twitter, turning stray thoughts into data points LinkedIn mapping out your professional life. Everything becomes potentially quantifiable.
Which immediately makes you ask, Okay, who is this we doing the datafying and what kind of value are they actually creating? Often it's well modelers, entrepreneurs.
Looking for efficiency, automation pretty much.
Yeah, And what's really striking is that so much of this wasn't bubbling up in academia initially. It was happening in industry, in tech companies. That's quite different from how statistics traditionally develop.
So if it's this broad new thing happening in industry, what does a data scientist actually look like? Should have this interesting exercise for.
Her students, the self profiling, Yeah.
Right, rate yourself on computer science, math, stats, machine learning, domain expertise, communication VIZ data visualization, and.
The results were all over the PLAYFFERNT thing. It showed pretty clearly that you know, no single person is going to be brilliant at all of those things.
Yeah, unicorns are rare exactly, which led to this idea.
Maybe it's more useful to define a data science team than one perfect data scientist.
Makes sense Like that Josh Will's quote, Oh.
Yeah, the classic person who is better at statistics than any software engineer, and better at software engineering than any statistician.
That captures it pretty well.
It does. Fundamentally, a data scientist extracts, meaning interprets data. They need tools from stats and machine learning, sure, but also crucially human intuition, and let's be honest, a huge part of the job is just collecting, cleaning, and munging data. Yeah, wrestling with it because real world data is just in apparently messy always Okay.
So moving from the who to the how? How does this actually get done? We hear big data all the time, but it's kind of a.
Vague term it really is.
Yeah.
The book breaks it down nicely though, three parts. One, it's a set of technologies. Two, it's potentially a revolution in how we measure things, and a point of view, really a philosophy about how decisions are going to be made in the future based on data.
Right and connecting big data back to basic stats like populations and samples.
That seems important, Oh, absolutely critical. There's this dangerous assumption sometimes with big data that nal You know, you have all the.
Data, but you never really do.
You pretty much never do. There's always something missing, some context you don't have. Kate Crawford's talk on the Hurricane Sandy tweets is such a powerful example of this.
Well was it gist of that?
Well, looking at the tweets, you might think New Yorkers were just casually shopping before the storm and partying after. But that's because they were the ones tweeting heavily.
Ah, so it missed the people really.
Affected exactly, coastal New Jerseyans whose homes were being destroyed. They weren't tweeting about their grocery runs. It just shows how subjective the whole process is. You, the data scientist, are turning the world into data. It's not objective.
Data doesn't just speak for itself.
Never be very skeptical if someone claims it does.
Okay, so data is subjective. We need context. Then we get to modeling. This sounds like where the magic.
Happens or the hard work maybe both, and when we say model here, we don't mean like a database scheme. We mean a statistical model.
Like a mathematical function.
Yeah, one that tries to capture the uncertainty, the randomness, and how the data was generated and building these it's definitely part art, part science. Textbooks don't really give you a step by step guide. You have to make assumptions, a lot of assumptions about reality. But yeah, we'll get into how that works.
And you mentioned a big pitfall here overfitting.
Yes, get ready to hear about fitting a lot, possibly until you have nightmares.
About Okay, okay, So what is it?
It's when your model gets too good at explaining the specific data you train it on, including all the random noise and quirks in that sample.
So it learns the noise, not the signal.
Precisely, and then it fails, often badly, when you try to use it on new unseen data. It hasn't learned the general pattern, just the specifics of the test it's studied for, so to speak.
Right, it can't generalize. So before we even get to complex models, what's the first step.
Exploratory data analysis? Eighty A. Yeah, it's absolutely fundamental.
And that's more than just plotting things, oh much more.
It's a mindset. It's about getting intuition, understanding the shape of your data, feeling how it connects back to the real world process that created it.
So what does it help you do practically well?
Gain intuition? Obviously, make comparisons, do basic sanity checks. Is the data on the right scales, it the right format, Spot missing values or crazy outliers, summarize things, even debug how the data was logged in the first place.
Okay, like the example with the New York Times ad data NYT one dot csv through NYT three one dot csv.
Exactly, the students had to plot distributions of ad impressions and click through rates the CTR for different age groups, and segment users by whether they clicked or not using r In that case, it forces you to really look at the data first.
And this whole process it kind of mirrors the scientific method, doesn't it.
It really does. You ask a question, you research, explore the data, you form a hypothesis, you test it, build a model, analyze the results, communicate them.
But with a twist.
Yeah, The big difference is the feedback loop. When you build a data product like a stam filter or a recommendation engine. It goes out into the world, people use it, Their interactions generate more data.
Which feeds back into the system.
Right, it's a dynamic cycle. It's not like predicting the weather, where your forecast doesn't actually change tomorrow's weather. Here the model influences the world, which generates new data for them, and.
The data scientist is involved all the.
Way through, absolutely, from deciding what data to even collect, to asking the first questions, planning the attack, and yeah, writing the code.
The Real Direct case study sounds like a good example of this using data in real estate.
Yeah, Doug Pearlson's company. Yeah, the traditional real estate broker system was well broken in terms of data. Brokers guarded their info fiercely. Public data was months out of date.
So what did Real Direct do?
They heided agents who pooled their knowledge, use data driven tips, built real time recommendations, tried to get live feeds on searches, offers, closing times.
All that stuff, and the business model reflected that efficiency.
Right, a subscription model plus lower commission because the data supposedly made things more efficient. The exercise for the students was literally, okay, you're advising the CEO, define a data strategy. What data do you need, where do you get it? How do you clean it, explore it, summarize it, puts it all together.
Okay, let's shift gears to the algorithms the engines driving this. Machine learning versus statistical modeling always confusing.
It is confusing because there's so much overlap. mL algorithms, mostly from computer science, do prediction classification clustering. Statistical modeling from SaaS environments does well prediction classification clustering.
So what's the real difference? Then?
Often it's about the goal and the origin. Many mL algorithms, especially the ones driving AI image recognition, speech recommenders, they weren't typically part of a core stats curriculum, and crucially, they're often not designed to help you infer the underlying why.
They just want the best prediction exactly.
Maximum accuracy is usually the goal, whereas statistical modeling often puts more emphasis on understanding the relationships the uncertainty. But honestly, good data scientists use both. They know when each approach is more valuable.
Right, and the warning you mentioned don't be a hammer looking for a nail.
Precisely, don't just grab the algorithm you know best and force it onto the problem. First, understand the problem text, figure out its mathematical structure, then see which algorithms fit makes sense.
Let's start with a classic linear regression.
Ah, yes, your bread and butter. For predicting a continuous outcome like price or temperature, using one or more predictors.
We usually start thinking about simple lines like why will twenty five x deterministic?
Right? But the key mental shift is moving to stochastic functions, acknowledging that there's randomness uncertainty. The line represents the average trend, but the points will scatter around it, and.
How do you find the best line?
You minimize the distance between the points and the line, specifically the sum of the squared vertical distances. That's the mean squared.
Error, and you evaluate it with things like P values.
Yeah, P values help you test if your predictors actually have a statistically significant effect. Are their coefficients likely different from zero? You can add more predictors. That's multiple linear regression, which then raises the question of feature selection, which predictors matter most?
And simulating data can help understand.
This oh hugely useful, especially in learning. You create fake data where you know the true relationship, then you see if your model can recover it. How sample size effects things? What happens if you add irrelevant variables? It builds intuition?
Okay, what about classifying things? Finding similar items?
That sounds like CA nearest neighbors or kNN.
Right knnn ad. How does that work?
The idea is simple. To classify a new unlabeled item, you look at its K closest neighbors data set where you do have labels. Then you assign the classes most common among those neighbors. Examples could be anything classifying emails as spam NOTT spam based on similar emails, assessing credit risk based on similar applicants, recommending restaurants based on what similar users like find The neighbors and the key choices are two main things. First, how do you define closest?
You need a distance metric Euclidian is common for points, Cosign for text, Hamming for strings, Manhattan for grid like paths. Depends on the data. Second, choosing K how many neighbors do you consult? One, five, twenty. That's a tuning parameter.
And this is where it gets interesting. The curse of dimensionality.
I asked. The curse kNN works great in low dimensions like recognizing handwritten digits, where pixels in a two hundred and fifty six dimension space have a natural closeness. But imagine text data with thousands of dimensions.
Words things get spread out exactly.
In high dimensions, everything is kind of far away from everything else. Your nearest neighbors might not be very similar at all in a meaningful sense. K and N breaks down. That's why it's usually bad for spam filtering.
Good point other K and N pitfalls.
Definitely need to scale your variables. If income is in dollars, in ages in years, income will dominate the distance calculation unless you scale them, and overfitting is a risk, especially of K one. Then you're just copying the label of the single closest point, which might be noise. Correlated features can also distort distances.
Okay, so kNN needs labels. What if you don't have labels, but you suspect there are groups in your data.
Then you're talking about unsupervised learning, and K means clustering is a common technique there.
Unsupervised So the algorithm finds the groups itself precisely.
You tell it how many clusters K you think exist, and the algorithm iteratively assigns points to the nearest cluster center centroid, and then recapculates the centroids until things stabilize.
Why would you do that?
Lots of reasons. Maybe you want to segment users for different marketing or product experiences, or build separate predictive models for distinct customer groups. K means helps you discover those groups automatically instead of you trying to define them with arbitrary rules or thresholds.
So it automates finding clusters in like many dimensions.
Yeah, that's the power, but it has its quirks. Choosing the right K is often more art than science, and sometimes the algorithm can get stuck in a suboptimal solution depending on where it starts.
Is it an old algorithm?
The basic idea goes back to the fifties Steinhaus and Lloyd m the term K means in sixty seven. There are newer versions like K means plus plus from two thousand and seven that try to start the algorithm off better.
Okay, so we said canon isn't great for spam filtering because of hi dimensions. What does work well that.
Brings us to naive base? A surprisingly effective probabilistic.
Approach based on Bayes Law.
Exactly remember baes Law from stats PA B p G.
Vaguely like the disease testing example, probability you're sick given a positive test.
That's the one we applied the same logic to spam. What's the probability in email is spam? Given that contains the word viagra, p spam, word peace, damp, word.
Needs sense what's the naive part?
The naive assumption is that the words in the email appear independently of each.
Other, which isn't true. Right, free and viagra probably appeared together more often than by chance in spam.
Totally untrue. But the simplification makes the math tractable, and surprisingly it often works really well in practice, especially for text.
Any pitfall just counting words.
Oh yeah, if a word like viagra only appeared in spam in your training data, the model might assign a one hundred percent probability of spam if it sees that word again. It's overfitting. Also, what if you see a word you've never seen before? The probability would be zero, which messes up the calculation.
So how do you fix that?
With laplace smoothing sometimes called additive smoothing, you basically add a small pseudo count to every word count, pretending you've seen each word at least once or a fraction of a time. It prevents zero probabilities and generally makes the estimates more robust.
And this was used in that NYT article classification exercise.
Yes, exactly Jake's exercise. Download two thousand articles from different sections arts, business, sports, et cetera. Using the API. Train a naive base model specifically Bernoulli ni bays here to classify them, tune the smoothing parameters, evaluate with a confusion matrix. See which words were most indicative of each section. Great hands on example.
Cool again, Another big one. Logistic regression. How's that different from linear regression?
Linear regression predicts a continuous value, right like a house price. Logistic regression predicts the probability of a binary outcome, something that's either yes or no zero.
One like will a user click and add? Is this email spam while this customer churn?
Exactly those kinds of things binary outcomes?
And how does it predict a probability? Doesn't a linear model output any number.
It starts with a linear combination of features, just like linear regression. Else plus matrox. But then it feeds that result through a special function called the logistic function or sigmoid.
Function the S shaped curve that's the one.
Pt one plus et. This function squishes any input value into an output between zero and one, perfect for representing a probability.
So alpha and beta still means something yep.
Alpha spain is related to the baseline probability. The overall odds and the betas are the weights for each feature, telling you how much each feature changes the law odds of the outcome.
How do you find the best alpha and betas?
Usually with maximum likelihood estimation, you find the parameters that make the observed data most probable. This often involves optimization algorithms like Newton's method, or, especially for huge data sets, stochastic gradient descent SGD.
SGD sounds familiar.
Very common in large scale machine learning. It updates the parameters using just one data point or a small batche at a time, making it efficient for massive data sets, especially sparse ones. Tools like mahood or valpol wabbit use it heavily.
Now, evaluating these models, you said, accuracy isn't always great.
Right, especially with imbalanced classes. If only one percent of emails are spam. A model predicting not spam one hundred percent of the time is ninety nine percent accurate, but useless.
So what should we use instead?
Look at precision of the times you predicted spam, how often were you right? And recall of all the actual spam how much did you catch often? There's a trade off.
And F score AUC.
F score tries to combine precision and recall into one number. AUC area under the ROC curve is really good because it measures performance across all possible thresholds and isn't thrown off by imbalanced classes. It's base rate invariant.
But even these metrics might not capture the real goal exactly.
Your model might have great AUC, but does it actually increase revenue or user engagement. That's where AB testing comes in, the gold standard for real world impact.
YE, run a controlled experiment, show the old system to group A, the new model to group B, and measure the actual business outcome you care about. Google's paper on experimentation really drives this.
Home and Media six degrees M six D use logistic regression for predicting AD clicks.
Yeah, a classic application user level conversion prediction, highly scalable, and effective for binary outcomes.
So okay, let's get into some real world messiness. Time stamps seems simple, but you said they're tricky. Oh they are. You get tons of time stamped event data, user clicks, check ins, sensor readings. That's big data right there. But they introduce subtle problems like what the biggest is causality. You cannot use information from the future to predict the
present or past. Sounds obvious, but it's easy to accidentally leak future information into your training data if you're not careful with time stamps.
Ah the time travel problem exactly.
You also have to be super careful distinguishing in sample training data from out of sample testing data based on time, and often you need running estimates, like a running average, not a single average calculated over all past data, because the world changes like in finance. Finance is a great example, they often use log returns instead of simple percentage returns because log returns handle compounding better and are more symmetric.
Volatility is key, and techniques like exponential downweighting let you calculate a volatility estimate that gives more weight to recent data, efficiently updating with just the latest info.
And financial markets have weak signals.
Extremely weak these days, so many algorithms are looking for the same patterns that the signals get arbitraged away quickly. You might aim for just a tiny positive correlation like three percent over a day. Yet linear regression is still used because it's robust even with all that noise.
And they use things like priors and regularization yes, to.
Keep the model stable. Priors incorporate existing beliefs, and regularization adds penalties to stop coefficients from getting too large or varying too wildly. It helps prevent overfitting in noisy high dimensional data. Mathematically, it often simplifies to adding a term to the covariance matrix.
Which brings us to figuring out what data to even put into the model. Feature engineering the art of data science.
Some say it's about deciding which variables features to use and how to transform them to be most effective for your model. Should's four quadrants idea highlights thinking about relevance, usefulness, whether something's even logged. Your creativity is a limit.
How do you select the best features?
Several ways filters rank features individually by predictive power like correlation, but they miss interactions. Step rise regression ads or removes features one by one from a model watching metas like R squared or P values, but careful, it can easily overfit.
What about embedded methods.
These methods build feature selection right into the model. Training decision trees are a prime example.
How do they work?
They recursively split the data based on the feature that provides the most information game at each step, essentially creating a flow chart to classify things. They're very interpretable. You can see the rules.
Like the Titanic survival example.
Classic decision trees can figure out rules like if female survive, if male and child survive. They handle continuous variables by finding optimal split points, but you often need to prune the tree, cutting back branches to stop it from overfitting the training data. In random Force, they take decision trees to the next level. They build many decision trees on different random subsets of the data and features bagging, and
then average their predictions. Much more accurate and robust than a single tree, But you lose that easy interpretability.
Now when selecting features that causation versus correlation issue seems critical huge.
Is user played ten times a feature describing the user's behavior, or is you show ten ads a future describing your action. If you want actionable insights to change outcomes, you need to focus on features you can actually control.
David Huffacker at Google talked about mixing message.
Yeah, they're a hybrid approach, combining qualitative insights from small user interviews with quantitative analysis of large scale log data. The Google Plus circles feature apparently came from that kind of mix interviews, sparking ideas tested with big data moving from description to prediction, which.
Also raises privacy concerns.
Absolutely, anytime you're dealing with human data, what are people worried about? Identity, theft, financial loss, creepy ads. The book mentions ideas like clearer data flow diagrams, privacy controls, sensible defaults, big thorny issues.
Let's talk visualization again. Mark Hanson framed it as more than just nice plots right.
As an art and information discipline that actually changes how we see things. Change the instruments, and you change the theory. His examples Million Dollar Blocks, Project Cascade, the NYT Lobby display showed data as a powerful communication and exploration tool, sometimes.
Even as art and Ian Wong at Square connected this directly to fraud.
Detection exactly. Square deals with massive potential fraud. The used machine learning heavily, but Wong stressed the importance of visualization alongside. It helps him understand why the model flags something, especially with high class imbalance.
Where fraud is rare so accuracy is useless.
Decisely, they focus on precision and recall. Wong's tips were great too. Models aren't black boxes. Iterate quickly like experiments. Keep your code clean and reusable.
And productionizing the models.
Making them work in real time, minimizing the gap between offline tests and online reality. Visualization isn't just for building the model. It helps the human operations team review transactions efficiently. It augments their intelligence like an exoskeleton.
Okay, this is a really important distinction. Prediction versus causality.
Yes, this is where things get philosophically deep. Maybe recommending a book because you read a similar one that's prediction. Understanding what causes someone to buy a book, or get sick or click an add that's causality.
The methods might look similar, sometimes they might.
Use similar stats, but the intent is different, and that changes everything about how you design your analysis and interpret the results.
And observational data is tricky for causality.
Very ory. Steelman mentioned that Okay, Cupid, example, may be beautiful in emails correlated with responses, but did it cause them or were people writing beautiful also doing other things differently confounders everywhere.
So the gold standard is.
Randomize controlled trials. RCTs randomly assign people to treatment or control. If the group starts statistically identical thanks to randomization, any difference in outcome can be attributed to the treatment. That's causal Inference.
And ab tests are like RCTs for tech.
Pretty much much easier logistically than clinical trials, usually less it's stay more control if you have the infrastructure randomly show version A or version B of a web page, measure the difference in clicks or conversions.
But watching out for things like Simpson's paradox, Oh.
Yeah, a huge pitfall and observational data, a trend appears in the overall data, but reverses when you break it down to the subgroups, so as you can't just trust aggregated numbers.
David Madigan's work sounds intense. Using the Reuben causal model.
Right a formal framework to think about what you can and can't know from observational studies. His almop project analyzed huge medical databases.
Two hundred million people.
Yeah, and the shocking finding was that for many medical questions using the same database but different standard analysis methods, you could get opposite conclusions, like whether a drug caused cancer or not.
Wow. That really underlines the challenge of inferring cause from observed data.
It really does choices in analysis matter profoundly.
Another scary problem data leakage. Claudia or Perlic called it a huge problem.
It is both in petitions and the real world. It's when your model accidentally learns from information it wouldn't have in a real prediction scenario. It learns the noise for signal.
How does that happen?
Could be implicit future information, like using diagnosis codes assigned after a prediction should have been made, or non random sampling. Perlik gave examples like patient IDs accidentally encoding clinic location, making cancer prediction trivial, or pneumonia models exploiting diagnosis codes.
So how do you avoid it?
Specific advice, Yes, very specific. Strict temporal cutoff. Remove everything learned even microseconds before the event you're predicting time, stamp everything based on when it was known, not just when it happened. Start clean with raw data, and most importantly, understand how the data was generated.
Also important is model calibration. Are the probabilities right crucial?
A model might be good at ranking risks, but are its predicted probabilities accurate? Like when it says seventy percent chance, does it happen seventy percent of the time? Unprune decision trees she showed often give probabilities that are too extreme, while logistic progression tends to be better calibrated.
Okay, Last, big area data engineering, scale and complexity, map reduce and hodup do data scientists need to know?
This? Often not directly coding map produced jobs anymore with newer tools abstracting it, but understanding the concepts is still really valuable.
Why was map reduce developed? What problem does it solve?
Think about counting word frequencies in terabytes of text. You can't load it all into memory on one machine, and even if you could split it, combining the accounts back together, it creates a bottleneck the fan end problem. Map reduce provides a framework for distributing the work across many machines, handling failures automatically and managing the.
Complexity, but it can be unintuitive.
Converting some algorithms into map processed chunks and reduce aggurate results steps isn't always obvious, and how evenly the data is spread across the machines is critical for performance.
And Joshuall's big data economic law love that one.
No single record is that valuable, but every record, the whole collection is incredibly valuable. The aggregate tells the story.
And tools like Prigle and mahoot build on this.
Yeah, prigles for large scale graph processing, mahoot for machine learning algorithms implemented on top of doop map produce handling the scale.
So wrapping up this deep dive, we've covered a lot. The goal was really to show what it's like to be a data scientist and also how some of this work gets done.
From defining the field to specific algorithms and those tough real world.
Challenges right and revisiting that core question what is data science? The book suggests it's maybe a set of best practices used in tech companies tackling problems with data, sometimes scientifically, but.
Also always be wary of the hype. It's not magic it's work.
It's a process absolutely, and thinking about the future the hope for next gen data scientists. It's not just about technical skill or salary.
No, it's about being good problem solvers, asking the right questions, thinking deeply about design process ethics.
Like Jeff Hammerbacker's famous quote about the best minds working on clicking ads that sucks. The aspiration is to use these powerful tools responsibly to make things better, not worse.
Which comes back to critical thinking always remembering data does not speak for itself. It needs interpretation, criticism, evaluation. It requires dealing with messy, incomplete, often inconclusive data. It's a human process full of judgment calls.
So for you, our listener, as you maybe explore data science yourself, what does all this mean. It's complex, it's evolving, it has huge potential and pitfalls. You've heard the practicalities, the ethical tightropes, the human effort involved.
It's not just about the algorithms.
Definitely not. So here's something to think about as you continue your journey. What ethical questions will you inevitably face when you start applying these powerful tools. How will you handle that tension between what data can show you and what should be done with that knowledge, especially when different people have different interests. Ultimately, how will you try to ensure that your work, your insights are used to make the world genuinely better and not just, you know, to make people click
