Okay, welcome to the deep dive. Today. We're jumping into webscraping.
That's right, specifically using PhD based on some key ideas from roller source Instant PHPE Webscraping.
Yeah, it's from twenty thirteen by Jacob Board. So things have moved on obviously, but the core ideas they're often still relevant, aren't.
They They really are. The book was aimed at beginners, you know, showing how to programmatically crawl websites, download content, and basically turn unstructured web stuff into structured data using using PHP.
So our mission here is to pull out those fundamental techniques from these excerpts give you a solid grounding in how PHPU webscraping works at its core.
Even if you're maybe adapting these ideas for more modern sites later on, the basics often carry.
Through exactly the source assumes. Maybe not a ton of programming experience, though knowing some PHP and HTML definitely.
Helps, sure, But the focus is really on the scraping concepts themselves.
All right, let's kick things off. Before you scrape, you need your tools. What's the basic toolkit according to these sources?
Okay, so first, obviously you need PHP itself that's the language, right, then a good place to write your code an ID integrated development environment. The source mentions Eclipse PDT.
PDT being the PHP development tools for Eclipse, so a specialized code editor.
Yeah. Basically makes coding easier, keeps things organized. And then you need a way to run the PHP and.
Probably a database too, like a local server setup exactly.
The source recommends xm yp It bundles a patche which is the web server, PHP and myseqel the database all in one package.
Ah. Convenient avoids installing everything separately.
Yeah, and it even includes php I admin oh.
Right for managing the myseql database visually useful later definitely Okay, So you install XAMPP maybe Eclipse any specific setup tweaks needed.
A couple of key things. The source points out. Setting your PHP path variable is good practice. Lets you run PHP scripts easily from the command line for testing and stuff, right, But the really critical one for scraping is enabling the CURL.
Extension curl Okay, what is that? Exactly?
It's a PHP library. You need it enabled in your main PHP can fig file the PHP dot i ne Without it, your PHP script can't really make web requests easily.
Ah, So it's essential for fetching pages programmatically.
Absolutely, And then you know, just test this setup, make sure a patche runs. Maybe run a simple finfo script to see if curl is listed as enabled.
Got it, so tool get ready curl enabled. Now the first actual step in scraping getting the web page.
Fetching the content. Yeah, this is where CRL comes into play directly.
Because it handles HTTP requests.
Exactly unless your script act like a browser, essentially sending a request to a URL and getting back the HTML source code.
And the source provides a function example curl get. What's the basic flow there, it's pretty logical.
You initialize a CRL session, think of it as opening a connection channel. Okay, Then you set options for that session, tell it what you want to.
Do, like the URL you want to fetch.
Curl opterill is the main one, and critically kurl opter own transfer. You usually want that set to true. Why is that so? That's url returns the page content as a string variable in your PHP script instead of just like printing it straight to the screen. You need it as a variable to work with.
It, right, makes sense? Any other key options?
Oh yeah, curl opc fallow location is super.
Useful for redirects like three ozho One's exactly.
Websites often redirect you. This option tells the URL to automatically follow those redirects to the final page. Saves you a lot to hassle. Nice.
What about curl op twuser agent. The source gives them an example string ah.
The user agent. It's basically a string that identifies your client your script to the web server one well, partly politeness, partly necessity. Some servers block requests that don't have a user agent string that looks like it's from a normal web browser, so sending one makes your script look less like a basic.
Bot okay, helps avoid immediate blocks potentially.
Yeah. Then there's curl optt header if you need to send custom headers sometimes needed for specific sites and curl up tfell on error.
What does that do?
It tells curl to treat HDP air codes like four O four not found or five hundred server error as well actual.
Script errors instead of just returning an empty page or an air page.
Right, it can be a simple way to detect if the request failed badly okay.
And the source mentions checking the HTTP response code itself. Using curl jet info.
Why bother because knowing the code tells you exactly what happened. Two hundred oka means success, you have the page.
Four oh four means it doesn't exist, right crucial info A four h three forbidden means you don't have permission.
Maybe you need to log in or your IP is blocked.
Three oh one moved permanently, which follow location handles, but good to know.
Yeah, checking the status code is fundamental for robust error handling. You know why something might have failed.
Okay, so CRL gets you the raw HTML, maybe a massive string of code. Now the real challenge finding the specific bit of data you want inside all that.
Extraction time and the main tool the source introduces here is XPath XPath.
I've heard of it with XML. How does it apply to HTML?
Well, HTML isn't always perfect XML, but it's structured right with tags and attributes. You can parts that downloaded HTML string into something called a DOM a document object model.
A tree structure of the page.
Precisely, an XPath is a language specifically for navigating that tree and selecting nodes elements attributes text based on their path or characteristics.
So it's more structured than just like searching for keywords in the string much more.
The source shows a function return XPath object. It basically takes the HTML strength.
So when you got from crl right, it.
Uses PHPs built in don document class to load that HTML, even if it's a bit messy.
I see the source us as an AT symbol before load HTML. Is that related to MESSYHTML?
It is? Real world HTML often have minor errors. The AT symbol in PHP suppresses warnings that load HTML might generate because of that imperfect markup. It stops your script potentially halting on minor issues.
Ah a practical trick for scraping. Okay, so don document lugs the hhamel than what.
Then you create a dom XPath object from that DOM document, and that XPath object is what you use to run your queries.
Okay queries. The source has examples like h one or span at class some class exactly.
Those are XPath expressions. H one means find any H one element anywhere in the document.
And the span it class.
That's more specific, find any span element that has an attribute named class with the exact value some class.
What about at href at.
The end of one example that's selecting an attribute, so maybe it found a specific link attag and added to rev says get the value of its h ref attribute the url itself.
So you run these queries against the XPath object and it.
Gives you back a list of matching nodes, elements.
Or attributes, and then the source shows item zero node value to get the actual text.
Right, the query might find multiple matches, so item zero usually gets the first one in the list. Then node value extracts the text content from inside that element.
Okay, so XPath is powerful for navigating that structure.
Very The source has a table with common expressions eight headed taro apro using brackets for conditions. That's your vocabulary for building these queries.
What if the data isn't neatly inside a tag or the structure's just chaotic, XPath might not work then.
Exactly, Sometimes XPath is overkill or just plain impossible. That's where as the source shows, you might need more direct approach custom functions.
Like the screen between function mentioned.
Perfect example, it's much simpler conceptually. Its whole job is to find a chunk of text that sits between two other known unique strings, so you.
Don't care about HTML tags, just find the text after start marker and before end marker.
Precisely, you give it the whole chunk of text, like the page source, the starting string and the ending string.
How does it work.
It uses basic PHP string functions stripos to find the position of the start and end markers, then subscripted to cut out the piece of the string between those positions.
Simple but effective if the markers are reliable. The source uses scraping a Google Analytics ID as an example.
Yeah, that ID is often embedded in JavaScript between specific quote marks or function calls. XPath wouldn't easily grab that, but scrape between works perfectly.
Okay, so we have text extraction covered with XPath and custom functions. What about non text content images?
Good question. You often need to grab images too. The process combines things we've discussed how So, First you usually find the images url using XPath. You look for an mg tag and grab its src attribute.
So like mng at src exactly.
That gives you the URL of the image file.
Then you use currl again.
Yep, you use your curl git function or similar to download the content at that image url.
But this time you're expecting image data, not HDML.
Right, binary data and the source suggests a good practice. Verify it actually is an image before saving it. How PHP has a function get image. You can pass it to the downloaded data or the filepath. If you save it temporarily, it'll return image dimensions if it's valid or false if it's not a recognized image type.
Smart So you verify it and then then you.
Just use standard PHP file functions. So open to open a local file for writing, right to write the image data you got from CURL into it, and F close to close.
The file, and you've saved the image locally.
You have and that basic method find url, download with CRL, save with file functions works for other file types too, like pds or whatever.
Okay, fetching static stuff in images is one thing, but lots of data is behind logins or search forms. Yeah, how do you interact with sites like that?
Yeah? This requires simulating form submissions. Forms often use the HTTP post method to send data.
So you need to make post requests with curl exactly.
The source shows a CURL post function example for this.
What do you need to know to make that POC request work?
You have to inspect the HTML form on the actual web page. First, look for the the form tag itself. You need its action attribute. That's the URL you send the post request to.
Okay, the destination you arel yep.
Then you need to find all the input elements inside that form and select or text area too potentially what about them? You need their name attributes. Those names become the keys in the data you send, and you need the value you want to send for each name.
So if there's an input name username, you send your username.
Right And crucially don't forget hidden input fields. They often contain important stuff like session tokens or form IDs that the server expects back. The source login example mentions needing email password, but also destination and format which might be hidden fields.
Ah, I got to check the source carefully. What about login specifically? Don't they involve cookies?
Absolutely vital When you log in successfully, the server usually sends back cookies to track your session for subsequent request or restricted pages. You need to send those cookies back.
How does Kira r L handle that?
It has options for it? Curl up pokie jar tells Curl to save cookies receives into a specified file. Cookie file tells Curl to read cookies from a file and send them with the request.
So you log in, save the cookies, and then use those cookies for future requests to stay logged in.
That's the basic idea. It maintains your session state.
The source also mentions posting files like simulating and upload.
Yeah. If a form has an input type file, you can simulate uploading a file using curl ob post fields. You set the value for that field name to the path of your local file, but you prefix the path with an AT symbol.
Crrol understands the AT means upload this file correct.
It handles reading the file content and sending it appropriately.
Okay, so you've send the PST request, maybe logged in. How do you know if it actually worked.
The simplest check shown in the source is just to look for a specific piece of text in the HTML response that you know only appears after a successful submission.
Like log in, successful or welcome back user exactly.
You get the response page source from curl and search that string for your success message. If it's there, it probably.
Worked, right. Okay, single pages forms, but the real power comes from scraping lots of pages, like product listings or search results that span multiple pages. How do you handle that pagination?
Yeah, traversing multiple pages. You start by scraping the first page as.
Usual, get the data, find the image, whatever, right, But.
While you're doing that, you also use XPath to look for the link to the next page.
Like in a next button or page number link exactly.
Commonplaces are l elements with class pagination or pager. You'd write an XPath query to find a tag inside that, maybe specifically the one with text next, and grab its href attribute.
So you get the URL for page two yep.
Then you scrape page two, and on page two you look for the link to page three, and so on. You basically collect a list of all the page URLs you need to visit.
Make sure they're full URLs, not relative ones.
Good point. If they're relative links like page two, you need to prepend the base website url to make them absolute before fetching with curl.
Then you just loop through your list of URLs, scraping each one.
Pretty much. I'll fetch page, extract data, fetch next page, extract data, repeat.
Now this sounds like it could hit the server pretty fast if you have hundreds of pages.
It absolutely can, and that brings up a really critical point. The source emphasizes politeness.
Right, don't be a nuisance or worse.
Get yourself blocked. Hammering a server with rapid fire requests is bad form and often triggers automated defenses.
So how do you be polite?
The simplest, most common way shown is to just pause between requests. Use PHP's sleep function. The sort suggests sleep rand one three.
Way, a random one to three seconds between fetching each page.
Yeah, it slows your script down, mimics human browsing speed a bit more, and drastically reduces the load on their server. It's essential for any non trivial scraping.
Okay, vital tip. So you've scraped politely across many pages, extracted tons of data. Where does it all go? Printing to screen is useless?
Now right, you need persistent storage. The obvious choice demonstrated is a database. Since XMPP includes my sequel, that's the example used.
First up is setting up the database table.
Yep, you need to design your table structure. Now define columns that match the data points are scraping, like book title, author, release date is SBN, etc.
And you can use food pi admin for that.
It's a handi graphical tool for creating the database and tables, setting data types, all that stuff.
Okay, tables ready. How does the PHP script connect and insert the scraped data?
The source uses PDOPHP data objects. It's a standard flexible way in PHP to talk to databases, including my school. You establish a connection using your database name.
Username, password, then insert the data.
For inserting lots of items, the best practice shown is using prepared statements.
Why is that better than just building insert strings?
Two main reasons. Security, It prevents SEQL injection vulnerabilities, and often better performance when you're inserting many rows with the same structure.
How do they work.
You write the insert query once, but use placeholders like question marks are named parameters for the actual values. You prepare this query structure with the database okay, Then you loop through your array of scraped data items like all the books you found inside the loop. For each book, you bind its specific title, author, etc. To the placeholders in the prepared statement, and then you execute it.
So the queer structure is set once and then just the data changes for each execution exactly.
It's cleaner and safer. One row gets inserted into your database table for each item in your scrape data.
Array, and the source also shows getting data back out.
Yeah, completes the picture using a select query, again often via pdo, to fetch the data you saved, perhaps looping through the results to display them in an HTML table on a web page.
Okay, this is getting quite sophisticated fetching, parsing, interacting, storing. As scripts get bigger, the code could get messy, right, repeating the same CURL setup or XPath.
Creation definitely can. That's where the source introduces object oriented programming or OOP principles as a way to organize things better.
Making the code reusable and tidier.
That's the goal. The core idea is creating a class, which is like a blueprint for an object. It bundles together related data properties and functions that operate on that data methods.
The book uses a human class analogy.
Yeah, like a human blueprint might define properties like name and age, and methods like speak or walk. An object is a specific instance created from that blueprint, like Bobby eesyl new human, where Bob has his own specific name in age.
So how does the example scrape class in the source apply this?
It takes the common scraping tasks fetching a URL, creating the XPath object and put some inside the class definition as methods exactly, and it often uses a special method called construct What's that do? The constructor runs automatically whenever you create a new object from the class. So in the scrape class, example, when you write page craper new scrape http dot example dot com, the struct method immediately takes that.
URL, the one you just passed in, calls.
An internal method maybe curl get to fetch the source code for that specific URL, and calls another internal method like return XPath object to create the XPath object for that source. It does the initial setup.
Work ah, so the object is immediately ready with the source and the XPath tool for its specific URL precisely.
The paid scraper object now holds its own source code property and six path object property ready for you to use. You'd access them like pagecap er, XPath object query.
And you could add more methods to the scrape class like save image or submit form.
Absolutely you build up a reusable toolkit within the class. Create a scrape object for any URL, and you have all your scraping tools ready to work on that specific pages content. Makes the main part of your script much cleaner, very.
Neat okay, final piece automation. You've built this great script, maybe using a class it saves to a database. How do you make it run automatically? Say every night?
Yeah, you don't want to manually run it every time. This is where scheduling comes in. The source gives an example using Windows task scheduler.
So it's not code within the PHP script itself.
No, Usually you leverage the operating system scheduling tools Task scheduler on Windows, or chron jobs on Linux or Maco. They are built for exactly this purpose.
How does it work?
Basically, you can figure a task in the scheduler. You tell it the scheduled daily at three am for instance. Okay, you tell the action to perform. The action is essentially run the PHP program and tell it to execute your specific scraping script file.
So you point it to php dot ex and then your my scraper not PHP file exactly.
You provide the full paths. When the schedule time arrives, the OS runs php Php execute your script and hopefully your database gets updated with fresh data and it just.
Runs in the background without you doing anything.
That's the beauty of it. True automation. Wow.
Okay, so we've really gone from set up getting PHP EXMPP, enabling CURL all the way through fetching pages, dealing with readerrects, user agents.
Extracting data with XPath for structure or custom functions like scrap between for trickier bits, handling.
Images, simulating form posts, managing cookies for logins, traversing pagization politely with delays.
Saving the results into a my school database using PDO and prepared statements, organizing the code using op with classes.
And finally automating the whole thing with task scheduler or coronate. That's quite the journey.
It really covers the fundamental life cycle of a web scripting PASK based on these source excerpts. Even though the web is way more complex now, especially with JavaScript rendering content.
Right, that's a whole other challenge it is, but these.
Core concepts fetching HTTP content, parsing structured or semi structured data, handling sessions, storing results, they still form the foundation. You might add tools like headless browsers to handle JavaScript on top, but you still need to understand these basics.
That's a great takeaway. The principles are enduring even if the tools evolve, so thinking about these building blocks how might you combine them for something more complex, or maybe how does that politeness principle the sleep delay become even more important, maybe even in ethical consideration when you think about scraping at a really large scale, what's the most interesting challenge you could tackle just starting with these foundational ideas
