Practical Web Scraping for Data Science: Best Practices and Examples with Python

Speaker 1

00:00

Welcome to the deep Dive. We're the show that helps you cut through the noise, taking stacks of sources and finding those key insights so you can get genuinely well informed fast. Today we're diving into something that feels well, part magic, part engineering, maybe even a little bit detective work.

Speaker 2

00:16

It's web scraping, right, the ability to basically write a little program that goes out onto the Internet and gathers data all by itself.

Speaker 1

00:23

Yeah, and seeing a work the first time, there's this real rush, like you've unlocked some secret level of the web or something.

Speaker 2

00:29

Definitely.

Speaker 1

00:29

Our main guide today is the book Practical Web Scraping for Data Science by Seppie Vanden Brook and Bart Basin's really comprehensive.

Speaker 2

00:37

Stuff it is. It covers a lot of ground.

Speaker 1

00:39

So our mission today to really get you up to speed on what webs are graping is why it's so important for data science and crucially the things you absolutely need to think about technically and maybe even more importantly ethically.

Speaker 2

00:52

Yeah, the how, but also the should you.

Speaker 1

00:54

Exactly get ready for some aha moments because we're going to unpack how you can get a solid handle on this pretty powerful skill. Okay, so let's start at the beginning. What actually happens under the hood when you just type say www dot Google dot com into your browser. Most it's just hit enter right right.

Speaker 2

01:12

And we take it for granted. But there's this incredible coordination happening invisibly, like you said, under the hood before you even see anything. All these protocols are firing off. DNS is translating that name into an IP address.

Speaker 1

01:25

The computer's actual address exactly.

Speaker 2

01:27

Then TCP make sure the data gets there reliably. But the layer we really care about for scraping the actual sort of language the web speaks.

Speaker 1

01:35

That's HTTP, Hypertext Transfer Protocol.

Speaker 2

01:38

That's the one. It's basically a plaintext conversation, the browser sense of request, the server sense of response, with the web page content. Understanding that back and forth is fundamental.

Speaker 1

01:47

Okay, so HTTP is the conversation. How do we get our program to join that conversation. How do we make those requests?

Speaker 2

01:55

Well, that's where Python's requests library is just fantastic. You can use Python's built d stuff.

Speaker 1

02:00

Of erlib, but requests is nicer, oh.

Speaker 2

02:02

Much nicer, way more user friendly. Think of it like a really efficient messenger. You just tell it, hey, go get this page using requests dot get or send this data with request dot post. It handles a lot of the fiddley bits for you automatically, like setting standard headers like user agent, which tells the server what kind of browser you are or pretending to.

Speaker 1

02:22

Be ah, so you can look like a normal browser pretty much.

Speaker 2

02:25

And crucially, it also lets you change those headers if you need to. Sometimes servers are a bit picky about who they talk to, so that flexibility is key.

Speaker 1

02:34

Right, that makes sense. So requests fetches the page content. But then you've got this big blob of well usually HTML, right, and looking at raw HTML it can be pretty intimidating all those angle brackets.

Speaker 2

02:45

Oh yeah, it looks like tag soups sometimes just the jumble.

Speaker 1

02:49

So how do we find the actual data we want inside that jumble?

Speaker 2

02:52

That's the next piece of the puzzle. HTML hypertext markup language. It looks messy, but it actually has structure. It uses tag it's like A for a link or DIV for a section. Then CSS styles it to navigate that structure and pull out specific bits. We use another great library, beautiful soup. Okay, it takes that messy HTML string and turns it into this navigable Python object like a tree structure you can walk through, ah.

Speaker 1

03:18

Like a family tree for the web page elements.

Speaker 2

03:20

Kind of Yeah, and then you can easily say, find me all the A tags or find the div with this specific ID, or even use these really powerful CSS selectors to pinpoint exactly the element you need based on its styling or position, and just.

Speaker 1

03:33

Building on that for you listening, your browser's developer tools are like your secret weapon here. Seriously, invaluable, absolutely cannot stress that enough. You hit F twelve usually, and the elements tab shows you that nice structured tree view of the HTML that beautiful soup will see. You can hover over stuff on the page, see the code light up.

Speaker 2

03:52

Yeah, it's brilliant for figuring out what tags or what CSS selectors you need to target. You can often just write click an element and copy it select directly.

Speaker 1

04:00

Just one quick tip though, remember view source shows the raw HTML the server scent. The elements tab shows what the browser has processed, which might include changes made by JavaScript after the page loaded.

Speaker 2

04:13

That's a really key distinction. Yeah, what you see in elements is often closer to what you need if the page is dynamic.

Speaker 1

04:18

Okay, perfect segway that covers static pages really well. But what about those more complex sites, the ones that are heavy on JavaScript where content loads dynamically as you scroll, or maybe they set cookies using JavaScript. Our requests and beautiful soup approach might just stop working there because they aren't actually running a browser. They're just fetching the initial HTML source.

Speaker 2

04:40

You hit the nail on the head. That's a huge challenge with modern web development. So many sites are JavaScript heavy. The initial HTML might be almost empty just to shell. The actual content gets fetched and rendered by JavaScript running in your browser, and sometimes that JavaScript is deliberately obfuscated, made hard to read to make reverse engineering it almost impossible.

Speaker 1

05:02

So you can't easily figure out where it's getting the data from exactly.

Speaker 2

05:06

Or maybe it sets a special cookie like a not a security token using JavaScript, and without that cookie you can't make further requests. So if requests can't run JavaScript, what do you do?

Speaker 1

05:18

And that's where I guess Selenium comes into the picture. It's more than just a scraper, isn't it. It's about browser automation precisely.

Speaker 2

05:25

Selenium's original purpose was actually for automated testing of websites. Yeah, making sure, buttons, work forms, submit, et cetera. But that makes it incredibly powerful for scraping because it literally drives the real web browser, Chrome, Firefox, whatever you can figure.

Speaker 1

05:37

So it can run the JavaScript.

Speaker 2

05:39

Yes, it loads the page, waits for things to appear, clicks buttons, fills in forms, scrolls down the page. Anything a human user can do, Selenium can automate.

Speaker 1

05:50

Ah. Okay, So for those sites where content loads as you scroll, like maybe infinite scrolling on social media or news.

Speaker 2

05:57

Sites, perfect example requests would only get the first patch, selem can actually perform the scroll action, wait for the new content to load because the JavaScript runs, and then grab it.

Speaker 1

06:07

That's clever. What about waiting Pages don't always load instantly, right?

Speaker 2

06:10

Selenium has tools for that too. You can use weights telling your script, hey, wait until this specific button is clickable, or wait until this piece of text appears before you try to interact with it. It makes your scraper much more robust against slow loading pages or dynamic elements.

Speaker 1

06:26

That sounds incredibly capable, but I imagine driving a full browser isn't as lightweight as just making a simple HTTP request. Is there a downside?

Speaker 2

06:34

Absolutely? There's a trade off. Selenium is significantly slower and uses way more memory and CPU resources than requests and beautiful soup.

Speaker 1

06:42

Because it's literally running Chrome in the.

Speaker 2

06:44

Background or something exactly. You're paying for that full browser emulation. So it's powerful, essential for those tricky dynamic sites. But you always want to check first, can I get this data with the simpler, faster request approach Uselen when you have.

Speaker 1

07:00

To, okay, makes sense, choose the right tool for the job. So let's say we've figured out how to scrape one page, maybe even a dynamic one with Selenium. How do we scale that up? How do we go from scraping a page to well, crawling hundreds or thousands across a whole website. That feels like a different beast.

Speaker 2

07:18

It is, And that distinction between scraping grabbing data from a specific page, and crawling, navigating link by link to discover and scrape many pages is really.

Speaker 1

07:28

Important, like what search engines do, but on a smaller scale.

Speaker 2

07:31

Maybe exactly they crawl the web constantly. For data science, if you need to crawl a site, you need a more structured approach. Best practices become vital. You'll almost certainly want a database, something simple like squilight is often fine. Maybe using a helper library like records to keep track of everything. What kind of thing, Well, you need a list of URLs you plan to visit the crawl frontier.

07:50

You need a list of URLs you've already visited so you don't get stuck in loops or scrape the same page multiple times. And of course you need to store the data you extract. It's also really good practice to separate the logic. Have one part of your code responsible for finding new links, the crawler, and another part responsible for extracting data from a page the scraper makes it easier to manage, and.

Speaker 1

08:13

You have to be careful not to hammer the website absolutely critical.

Speaker 2

08:16

You need to build in delays or cool down periods between your requests, don't just fire them off as fast as possible. You also need air handling. What if a page is temporarily down, You need logic to retry later and thinking about doing things in peril can speed it up, but you have to be even more careful not to overload the server. Then it's a balancing act.

Speaker 1

08:35

And you mentioned some specific tools for handling URLs.

Speaker 2

08:39

Yeah, little things become important when crawling, like earlib dot parse dot earl join. Websites often use relative links like about us. Your crawler needs to correctly combine that with the base you RL get the full address, earl Join handles that reliably, and Earl's frag helps remove those fragment identifiers the bit after the hashtags. You don't accidentally crawl HTML church section one and PA html tag section two as if they were different pages.

Speaker 1

09:05

So why is this scaling up, this crawling capability so important for you our listeners doing data science? What doors does it open?

Speaker 2

09:13

It opens the door to data sets that just don't exist anywhere else or aren't available in a neat packaged format. The web is this enormous, constantly updated, incredibly rich source of well mostly unstructured.

Speaker 1

09:25

Data a real treasure trove if you can access it exactly?

Speaker 2

09:28

Imagine you want to build a sentiment analysis model for product reviews. You I need thousands, tens of thousands reviews. Where do you get them? You call e commerce sites?

Speaker 1

09:35

Well? Maybe tracking housing prices perfect.

Speaker 2

09:37

Example, collect real estate listings across a whole city or region for analysis or visualization. We've seen amazing projects born from this. Google Translate got massively better by using scrape texts from across the web. There was the Billion Prices project at MIT, which scraped online retailers daily to create near real time inflation tragging way faster than official government stats.

Speaker 1

09:59

Wow.

Speaker 2

10:00

Yeah. Or think about monitoring social media for mentions of bitcoin to gauge public sentiment, or analyzing job postings to see which data science skills are currently in demand. All rely on robust crawling. It's about turning the messy, sprawling web into structured, valuable information for your data science pipeline.

Speaker 1

10:19

Okay, so let's pull back a bit thinking about that data science pipeline, maybe using a framework like CRISPADM. Where does webscraping fit into the bigger picture?

Speaker 2

10:27

Good question. It primarily slots into the early phases data understanding and data.

Speaker 1

10:32

Preparation, finding and getting the data right.

Speaker 2

10:34

Specifically, it's often part of identified data sources, realizing the web is a potential source, and then select the data and actually collecting it. It's usually about enriching data sets you already have, or maybe creating a totally new data set from scratch using web data.

Speaker 1

10:51

But it's not just a technical task. Is that you mentioned managerial concerns?

Speaker 2

10:55

Yes, and this is often underestimated. There's this crucial gap between building a model using scrape data the model train phase and actually deploying that model where it needs ongoing scrape data to work the model run phase.

Speaker 1

11:09

Ah, because the website might change exactly.

Speaker 2

11:11

Websites change all the time, layouts change, HTML structure changes, login processes change. A scraper that works perfectly today might break tomorrow.

Speaker 1

11:20

That warning, so your production model suddenly stops working because it's data.

Speaker 2

11:24

Feed broke precisely, which means web scrapers require ongoing maintenance. Someone has to monitor them, fix them when they break. That's real cost, and that's why the golden rule. The first piece of advice is always look for an official API.

Speaker 1

11:36

First application programming interface, a structured way for programs to get data.

Speaker 2

11:42

Right, If the website offers an API and it provides the data you need and the terms are acceptable, maybe it's free or reasonably priced, use the API. It's almost always going to be more stable, more reliable, and less likely to break than a custom scraper you build yourself.

Speaker 1

12:00

Really solid advice. But what if there isn't an API, or maybe the EPI exists but it's I don't know, super limited in how many requests you can make, or it just doesn't have that one specific piece of data you absolutely need. When does building the scraper become worth the hassle?

Speaker 2

12:15

That's the judgment call, isn't it. It's a trade off. If the API doesn't cut it for whatever reason, cost rate limits, missing data fields, then yeah, building and maintaining a scraper might be your only option.

Speaker 1

12:26

So you weigh the development and maintenance effort against the value of the data exactly.

Speaker 2

12:31

But you go into it with your eyes open knowing it's likely going to require ongoing work. Is that cat and mouse game people talk about. Websites might actively try to block scraper, so you might need to adapt your techniques constantly.

Speaker 1

12:43

And speaking of blocking and well, potential conflicts. The legal side of this you mentioned it's complex. It sounds like it's not just a technical decision, but a legal and ethical one too.

Speaker 2

12:53

Absolutely, it's murky waters. Legally speaking, there isn't one single law that says webs scraping is legal or web scraping is illegal. It depends. Several legal arguments tend to pop up in court cases, at least in the US, like what, well, there's breach of terms and conditions. If a website's terms of service explicitly forbids scraping and you clicked I accept somewhere, they might have a case. We saw that with Ryanair winning against a flight data scraper.

Speaker 1

13:21

Okay, so read the terms definitely.

Speaker 2

13:23

Then there's copyright infringement. Is the data itself copyrighted? Usually facts aren't, but the presentation might be. The fair use doctrine often gets debated here. Think about Google book scanning millions of books, lots of legal wrangling there. There's also the CFAA, the Computer Fraud and Abuse Act. It's meant to target hacking unauthorized access. Sometimes companies try to argue that scraping constitutes unauthorized access, especially if you bypass technical.

Speaker 1

13:48

Barriers HM that seems like a stretch for public data.

Speaker 2

13:52

Courts has struggled with it. There's also older concepts like trespass to chattels, basically arguing your scraper is interfering with their server resources, especially if you overload it. And then there's the robots dot txt file.

Speaker 1

14:05

Right, the file that tells bots where they shouldn't go.

Speaker 2

14:07

Yeah, it's not strictly legally binding in most cases, but ignoring it is definitely not playing nice and signals you're disregarding the site owner's wishes. It could be used as evidence against you.

Speaker 1

14:19

You mentioned a specific case earlier, hi q Labs versus LinkedIn, that seemed pretty important for this whole public data question.

Speaker 2

14:26

It was, yeah, a really significant case high q Labs was scraping data from public LinkedIn profiles. LinkedIn tried to stop them technologically and legally, invoking the CFAA.

Speaker 1

14:37

And what did the courts say?

Speaker 2

14:38

The courts, particularly the Ninth Circuit, basically ruled that scraping publicly accessible data, even if the site tries to block you with technical measures or says not to in its terms, doesn't necessarily violate the cfaas without authorization clause. The key was that the data was already opened.

Speaker 1

14:54

To the public, So if it's public, maybe it's fair game.

Speaker 2

14:57

It leans that way, but it's not a blanket permission slip. It highlighted just how blurry the lines are around public information and unauthorized access when it comes to the web. The legal landscape is definitely still evolving.

Speaker 1

15:11

Okay, So with all that complexity technical, ethical, legal, what's the takeaway for you, our listener? What are your core responsibilities when you decide to scrape data. What's the baseline for being a good digital citizen.

Speaker 2

15:24

The absolute number one rule is play nice, be respectful, don't bombard a website with requests. Think about the impact you're having.

Speaker 1

15:33

Don't be the reason their site goes down exactly.

Speaker 2

15:35

That can cause real financial damage, and that's when legal action becomes much more likely. We saw qvc SEW a company called resultantly claiming excessive scraping caused outages costing millions. You don't want to be that person, so throttle your requests, put delays in, Identify yourself with a proper user agent header if you can, maybe even include contact info trans

15:55

if appropriate. Yes, always always check the robots dot com txt file and the terms of service first, see what the site owner explicitly asks for or forbids, and if you can, the absolute safest route is to get written permission from the website owner.

Speaker 1

16:10

That might not always be practical, but it's the ideal.

Speaker 2

16:13

It's the gold standard. Yeah, and just pause and think is this data truly intended to be public and consumed in this way? Or am I accessing something private or trying to circumvent a system using common sense and acting ethically is just as important as writing clever code.

Speaker 1

16:29

Wow, Okay, that was definitely a deep dive. We've gone from the basics of HTTP.

Speaker 2

16:34

To parsing htmail with beautiful soup.

Speaker 1

16:36

Tackling tricky JavaScript with Selenia.

Speaker 2

16:38

You're scaling up with crawling techniques.

Speaker 1

16:40

I'm wrestling with those really crucial management and legal questions. I think you listening should now have a much clearer picture not just of the how of webscraping, but the really important why and when and when not to.

Speaker 2

16:53

Yeah, it's about using this powerful tool effectively but also responsibly.

Speaker 1

16:57

Absolutely, So here's a final thought to leave you with something to chew on. We've talked about this cat and mouse game between scrapers and websites and the shifting legal sands. Considering how fast things like AI are evolving and maybe new techniques for hiding or protecting data online, how might our very definition of publicly available data change in the next few years.

Speaker 2

17:21

That's a big question, right.

Speaker 1

17:23

And what could that changing definition mean for how we gather data, how we do analysis, and maybe even the kinds of innovation that are possible across well pretty much every field.

Speaker 2

17:31

Yeah, what does public mean when data is generated by AI or locked behind complex interactions, Lots to think about.

Speaker 1

17:38

There definitely something to ponder. We'll leave it there for this deep dive

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript