The Common Crawl

Speaker 1

00:00

Brought to you by Toyota. Let's go places. Welcome to Forward Thinking. Hey, they're in Loving and Forward Thinking, the podcast that looks at the future and says, Kitty McGee's in Dublin Town upon the Crawl. I'm Jonathan Strickland and I'm Joe McCormick, and today we're gonna be talking about the crawl. We are talking about the crawl, not a pub crawl. No, sadly, not a pub crawl, which is what I was referring to in the lyric. But that's

00:33

not what we're talking about today. The krawl. What is that? Is that the name of a movie that was like a like a fantasy movie from the eighties, or it sounds like a I'm thinking krawl. Oh yeah, that's a science fiction fantasy film with a phenomenal one I might add phenomenal science fiction fantasy film. Okay, So why would we be talking about a crawl that's not a pub crawl and not a sci fi fantasy movie. And it's not the future of baby these crawling right. No, it

01:02

has something to do with the Internet. Yes, it has everything to do with the Internet Web in particular. Actually, uh, and here's a funny little little tidbit of information that you probably already knew, but you might be a little bit fuzzy on. Wait, what's the difference between the Web and the Internet. Because when I say the Internet, most of the time, what I'm talking about is the place where people leave comments and argue about things, which would

01:26

be the Web mostly. Right. So Internet is the network of networks of computers. Right. So You've got all these different computer networks that then connect to a larger backbone that allow all these various networks to interact and communicate with one another. That is the Internet. The Worldwide Web is one thing that sits on top of this network of networks, other things being email and FTP servers and other stuff that uses the Internet as its method of

01:56

transmitting data to and from different computers. But the Old Wide Web is often what we think of with the Internet because it is a very forward facing part of the Web, or the Internet rather, Right. One way to think about the Web is that it's a gigantic collection of interactive documents. Yeah, exactly. Yeah. Some of those documents are very static and they don't change frequently or at all.

02:21

Some of them are more like programs. Yeah, yeah, some of them are more like like white boards, where you know stuff is being put up and taking down and put up and taking down constantly. So some of them linked to lots of other documents, some do not, yep, so some are applications. Right, So you've you've got this massive number of documents. And when we say massive, uh,

02:43

it's hard to put it all into context. First of all, if you talk about all the information that we have created, not us, but humanity, humanity itself, you have the three of us have done our share, but no humanity overall. All the information that has been created, well back in two thousand and twelve, that was estimated to be at two point eight zetta bytes two point eight trillion gigabytes,

03:10

trillion gigabytes. It's bigger than my hard drive significantly. So yeah, if you're hard drive can hold two point eight zetta bytes, I need to see your gaming rig sir. I think I have downloaded two point eight zeta bytes of Pirated anime before I was gonna say I have two point eight zeta bytes of Skyrim mods. But so no, not all of this data is necessarily available for access on the web, right, This is just data that we have created. So let's let's narrow it down and look at the

03:43

information that's actually on the web. So the web has between ten billion and one trillion documents on it. Now that's a huge range, but it tells you that it's hard to make an estimate about something that one is so big and two is rapidly evolving. Right, there are always things being added to it and deleted from it. Yeah, you have servers that go offline from the Internet. If those servers had web pages on them, those, unless they've

04:11

been mirrored onto other servers, are no longer accessible. They have they have left the web. Other people deleting their MySpace accounts. Why would you do that? Look, I have so few friends there, but so many awesome bands. Uh yeah, so seriously true moment. Does my Space still exist? Yes, yes, it's largely How recently did you check? Probably? Probably then eight months ago. Let's look it up. Yeah, because it's a it's a music discovery site more than anything else. Now,

04:44

Oh yeah, here we go MySpace dot com. Oh oh, it's it's breaking my browser home. I was about to say, why did you go to that? You realize that my Space is like the home of the auto play music file. Right, No, we just talked about how that's like my least favorite thing. Well, let's not. Let's not invoke the auto playing music gods. Yeah. So, so the reason why we're even talking about how much information is on the web and how many documents there

05:10

are out there is that the web. You can think of the web as representing the world's largest database of information, and that information spans every topic imaginable. Yeah, and there's lots of great stuff out there that might be really relevant to you, might have answers to questions that you have, or it might just be very interesting to you. But a strange question that you may never have considered is how do I get the stuff that I want to

05:38

get from the web? I mean, you know, how you get it in practical terms while you go you sit down at Google and you type in terms or Google yeah, or you or you maybe have some kind of aggregator, like a friend on social media or some kind of things content writer, or perhaps you have received the direct to your l of a website that you wish to visit. Ye, might you might have one in particular in mind that you go too frequently, and so like dinosaur Comics Dot

06:09

com awesome should example. Yeah, so they're all all these different ways. But let's say that you want to use the web to do something dion just visiting a particular web page if you know the U r L, that's pretty simple. But what if you're you're just trying to find something. Yeah, maybe that you don't even know what that thing is, or you know what that thing is, but nobody has gathered that and placed it into an

06:35

easily digestible piece of information. So, in other words, let's say that you're looking at some sort of statistical uh result that you want to know. You want to know the percentage of people who drove red cars in two thousand and twelve who ended up getting speeding tickets, and you know this this sort of thing like, there may be a web page out there that has that specific

07:00

answer on it, but there may not be. However, there may be the data out there that exists across multiple web pages and multiple places that could answer that question for you, but there's no easy way for the average person to be able to collect and coalate all that data, analyze it and get to a meaningful answer, especially not quickly, because if you wanted to go through the entire Internet to try to find that information, it would take you a minute. Yeah, it would take quite some time. So

07:32

what we wanted to do is tough. Sometimes well, again, depending upon what it is you're looking for, right, because in some cases you may have very little information and it may take you some time just to make sure that the information that you do have is worthy of consideration.

07:49

Or you may have the opposite problem. Let's say you want to look at anything that has to do with about cats, good grief, You're gonna have so much information on the inner and on the web that relates to cats that finding the you know, separating that the signal from the noise would take you a really long time. So and uh, this problem is already has a solution, and that is why we are today talking about web crawlers. Yeah.

08:18

And web crawlers are something that have been around for about as long as the web has been around, because people realized early on that in order to make the web really user friendly, especially once it grew beyond a collection of you know, three computers, right, Yeah, three computers with twelve web pages altogether, Once you get past all of that and you get to a point where it really is growing rapidly. You need a way to navigate through the web and find the stuff you're interested in.

08:48

You need an index. Yeah, you have to have that index because otherwise the only other option you have is to know the address of a particular web page and then to just follow whatever links that web page have is to have, and then once you hit a dead end, you've got to backtrack. And you know, it's kind of like a choose your own adventure book, And it's a choose your own adventure book that's that isn't even connected

09:09

to all the pages that you need. Right, So indexing is a way of creating a means to find web pages about any given keyword. Right, And again, this is a big, big job. You can't expect this to be something that only humans are doing under human power. It would take way too long and it would be exhausting. So there have to be automated ways to index web pages. Well yeah, I mean, just consider the ridiculousness of the alternative. So let's say you are searching for a term and

09:47

that term is, I don't know, lobster baseball. Somewhere out there, there might be a page about lobster baseball, but it would not be a good way to find it. To say, well, I'm going to ping every web server in the world and see if it's offering any public pages that say lobster baseball on them. Yeah, that would not especially you know, as the Web grows and gets larger and larger and larger,

10:14

that task becomes impossible. It would just it would take your computer longer than your lifespan to complete the job, especially considering that, as we mentioned before, the web is constantly changing, so we would have new web servers joining while you're still doing this pinging operation, which means you just have you know, you've added more that you have to ping before you're done. You never finish. So what's the solution, Well, web crawlers would be would be the solution, Joe.

10:42

Web crawlers and search engines are our favorite things here at Health Tough Works. I mean, if if it weren't for them, our jobs would be significantly more difficult. So, uh, let's say that you've got all right, So to break it down, we've got web servers that have web pages on them, right, we have it's a computer somewhere out there.

11:02

It's got a public facing document that it will show you if you ask for right, and your browser is the way that you ask for it right, So your browser is your conduit to getting the information that's stored on other computers that maybe on completely different networks, on another on another part of the world even and the fact that you have a browser that is what allows you to have the access to that document that exists on that other page. But those servers can have really

11:31

funky names. Um, the web pages may not have a title that is is identical to what it is you're looking for, but the information may be in that page. Sure for to to use my prior example, dinosaur comics dot com used to be known only as quantz dot com. Perfect with the QW the way that you sometimes spell words much yes, exactly the way the way words are never spelled in English. Uh yeah, I I My example of my notes I wrote is that let's say that

12:00

you're looking for funny cat memes. The funniest memes happen to be on a page that has the title things. FDR definitely didn't say, Well, the title of the page wouldn't tell you that there are cat memes on there. You would need something to have searched that page to understand what actually appears on that page, the context within which it appears, and to be able to serve that up to you. And that's really where the crawlers come in.

12:25

They they build out these indexes of words and where to find those words on the web like uh, they use lots of They use well, actually pretty simple software. They are often referred to as either robots or spiders, and they're called spiders because they crawl the web. That is good. Yeah, alright, So here's where we mentioned that most of these terms. So wait, are we the flies? Good? Good question? I mean, I think the uh cams, I'm

12:59

not so. So here's where we mentioned that a lot of these terms were all invented around the same time. And boy, when we when we go with a metaphor, we just go whole spider. So um so all right, So spiders typically start by traveling to web servers that have lots of traffic, the ones that are the most popular, and they explore the most popular web pages and start to build up the index of words of those web pages.

13:28

Then all the links that are on those popular web pages, the spiders start to follow those links and index those pages in turn, and then do the same thing over and over and over again, so they just you know, it's it is like a spider web or a crack in the glass where you see its splintering over and over while the glass shatters. The same sort of thing. It's following all those potential pathways, and they can hold hundreds of pages open at a time. We're talking like

13:57

three hundred pages a second. So yes, more than Google Chrome will allow me to have opened before my computer says listen, I give up. Uh. So, depending on the crawler of the spiders will index these pages based upon which words appear in the page and where those words

14:16

actually appear in that page, like in what context. So you may remember in the early days of the web, before web search engines got really sophisticated, that some people would make a web page and then just litter the bottom of the page with tons of random words that we're doing really well in search, mostly because they had ads served on the page of the type that they

14:39

got money from per page view. So I might have a talk Brittney Spear, right, Yeah, it would often be celebrity rumors and gossip that kind of stuff, and just random recipe. Yeah, yeah, it'd be weird stuff, like totally some of it would be disturbing to read. You're like, wow, I can't believe that that. To know that this particular turn m is a very popular search term is disturbing others would Yeah. Yeah, I'm more of an A C, D C kind of guy myself, so I'm I'm with

15:09

you there. So anyway, Uh, you know, this was a way of fooling search engines into into indexing that page on multiple indexes so that it would appear no matter what search you put in, your page would pop up. You as a as saying, if you're assuming you are the one who are administering this web page, you have no ethics, Like, you don't care if people come to your page and are completely disappointed because it has nothing to do with the search term they put in there.

15:36

You just want to get those sweet sweet clicks. You just need the page views because you need to pay the bills, right, So, uh, search engines and spiders got more sophisticated, so they were able to look for the placement of words where it fell in the page, whether or not it appeared more than once within a page, to understand if a page really was about that particular search term, or if it was just one of those things where the word happened to appear once, it may

16:05

be a saying or a quote that has very little to do with the actual substance of the rest of the page. You know, this would help the search engine rank the page in search right. Right, So, the final product of of these spiders doing this indexing is called a crawl and and it's essentially a lightweight copy of the Worldwide Web that's built to be much more easily

16:30

searched than the whole web itself. Uh and and a crawl usually consists, therefore, of this huge cash of data about the web, including like the text of each page it's spiders encountered, the code that constructed those pages h T, M, L or et cetera, uh, and a certain amount of metadata um you know, certainly the pages r L and maybe the tags. As we discussed, that's not always as useful as it used to be due to uh scammy stuff.

16:56

But yeah, so so creating a crawl is a huge project in terms of time and computer equipment and drive space and spider programming and just sheer Internet bandwidth. Right. So, for the longest time, this is something that was really only accessible by big corporations like Google or Microsoft, Yahoo that. Yeah, we're talking huge companies that have the computer power and the bandwidth to pull this sort of stuff off on

17:25

a on a regular basis. And while those are incredibly useful for us as consumers, if we are looking for a specific piece of information that happens to live somewhere on the web page, if we want to do more of a big data analysis something where we need to collate the information across multiple, perhaps hundreds or thousands of web pages, it's not easy, right. We don't have those tools for the most part, right right, Because when you go to Google, you can't access that level of information.

17:55

Yeah you can, you can ask, uh, you know what Hugh Grant was doing last week? Right? Yeah, you can get the most popular or the highest ranking search results, which could give you at least some useful information. But again, if you want to do a wide spread study on a specific thing, unless someone's already done it, in which case you may just need to replicate their their study to make sure that it was correct. Um, you're you're kind of out of luck. So where can someone turn.

18:26

Let's say that it's a researcher who's working on something and they don't work for one of these big companies. Where can they turn to leverage the incredible asset that is the World Wide Web? One? Gil Lbez started up a nonprofit corporation called the Common Crawl Foundation and it has been since then working on providing public, publicly accessible, free crawls to anyone who wants to use them. And uh ls is a really interesting dude. A little bit

18:59

of background on him. Um. He co founded a company back in the nineties called Applied Semantics, which created software that matched ads to web pages like contextually and automatically. Oh we know a little bit about that. Yeah, yeah, And this prompted Google to acquire them in two thousand three for like a hundred two million bucks, So not doing too bad for himself. Also, that's that's essentially the

19:24

reason why Google AdSense exists. That the programming that led to Google AdSense so very very practical application of that

19:32

contextual understanding, right right. Um. In two thousand eight, interestingly, and kind of a side note, he founded a company called Factual, which seeks to gather and analyze global location data in order to create a repository of really high quality, easily accessible location data that's uh factual um and and companies like Bang and Samsung and Yelp all use factuals to construct local maps and personalized advertising for mobile consumers.

20:05

So uh so pretty nifty stuff. And what I am saying is that elbas is passionate about and experienced with big data, right, and we've talked about it before on this podcast. That big data is. You know, it sounds like one of those just buzz industry terms, but it really is one of those things that holds a huge amount of potential to affect our lives in different ways, assuming that we've developed the right means to analyze that massive amount of information that's had there to collect it

20:37

in the first place. Sure, and once you have a way of processing, of collecting and being able to access and process vast amounts of data, you can do a

20:46

lot of amazing things like big data. And the ability to process it might be the key to say, for example, computational modeling that predicts complex social phenomenon by analyzing big data coming from social media and from news and from weather and from all kinds of sources, it can really mean that we are able to actually see elements of order and what previously appeared to be a truly chaotic system,

21:13

which is kind of exciting. Sure. And then on the other hand, a lot of people think that big data could be one of the ways that we finally achieve that next level of artificial intelligence by having machines sort of plumb the depths of this data with self teaching and self learning mechanisms. Right, well, let's get back to the common crawl okay, Right, So the Foundation began compiling crawls back in two thousand eight. The most recent one that they released as of this podcast at the end

21:45

of May, was from April. It was some a hundred and sixty eight terabits in size bytes in size huge. That's big, uh, and contains some two point one billion web pages. That's not that many, really, I wrote I wrote forty seven million web pages before breakfast. No I did not. I'm just kidding. No, that's a lot. Yeah, yeah, yeah, it's it's a it's a bunch um uh. But so they're they're continuously indexing and releasing new crawls, right, It's

22:26

not like it's here's the Internet and now we're done. Yeah. Yeah, they've been releasing a new crawl every month since July. That's I mean, that's incredible you think about the amount of work that does. It also means that you have like a a time a timestamp, like photograph of what

22:44

the web was at that moment from these crawls. Yeah. Yeah, I hadn't thought about it quite that way before, but yeah, that's it's kind of you know, things that may not exist from one month to the next, you could actually see and watch those trends. Yeah, it's fascinating. Yeah. I'd say one of the main ways that I often encounter web archives is when I'm trying to find evidence of

23:08

something somebody did in the past that they wanted expunge. Right, This makes sound like I'm some kind of detective, not like I'm trying to find but you know what I'm talking about. No, I know, I will post something and then they'll be like, oh wait a minute, No, that was a bad idea. I know if you try to delete it. I know, if you use archive dot org, you can find one of the web pages I built way back when, and I never want anyone to ever

23:35

see it because it was that bad. But they will forever be able to Yeah, you shouldn't talk about it on podcasts. Pretty sure they're not going to be able to find it. Tens of thousands of people. Let me guess you had some You had a bunch of rage against the machine lyrics, and it auto played midias. They're so close closest, No, you're really far away. I wrote, I made some web pages for I particular company I worked for, which is unless you know the company I

24:04

worked for that I'm specifically referring to. That's why you're never going to find that particular web page, and you shouldn't. It was terrible. It was. It was about, we're going to find these Go look on his LinkedIn profile. We can figure out what company was. I already found some lobster baseball stuff, so yeah we can. That comes from you almost Got a spit take on you almost got and uh yeah, that comes from a pre episode conversation. No,

24:31

that was actually in the episode, wasn't it. I can't keep track anymore, all this dungeons and dragons saving throw talk we had. OK, hold on, I've got a question, So hold on. If the Common Crawl is trying to preserve and make accessible continuously updated snapshots of the web

24:52

weird on Earth? Are they're going to like store that and make it available, right, because it's not going to fit on like a thumb dry So where I think it's also funny that that sort of becomes part of the web, Like the web now incorporates a snapshot of the previous web, and so it just gets that much larger. So yeah, where's the stuff living? It is all living

25:13

on Amazon's web services. Uh, specifically, it's it's stored in Amazon Simple Storage Service or S three as it is sometimes known, and you can analyze it via Amazon's a Last Compute Cloud or e C two. And this is so cool guys, because because okay, given Amazon's web service scope, it means that practically anyone in the whole world can download entire crawls for free, or can if they don't want to, you know, use hundred sixty terabytes of space.

25:47

They can they can just use e C two to really easily run simple data crunches for like an hourly charge, like anywhere from a few bucks to maybe fifty dollars for pretty simple computations. So and of course, more savvy users can write their own code to investigate stuff with. But but but yeah, for for the common user. This is revolutionary. Yeah. To me, this is uh. I mean, obviously I would I would recommend doing the approach where

26:14

you're you're searching on Amazon stuff. I can't imagine the phone call you would get from your internet service provider. I see you're trying to download a hundred and sixty eight terrified it's worth of data over our lines. You

26:28

have gone significantly over your bandwidth cav. Yeah. And they can do all of this because Amazon has specifically chosen to wave their storage fees for for this and a handful of other things that they consider to be of wide public interest, like like like weather in census data.

26:46

And it's that's incredible, right, I mean, because you are talking about a significantly huge amount of data, So to say, you know this, this is so important and so potentially beneficial to mankind that we're not going to end up, you know, charging these the storage fees for it. That's that's great, Yeah, very very encouraging. So when we get down to all, right, well, what can you actually use

27:12

all that data for? Well, just think about the stuff that's on the web and pretty much anything you could think of that you know, you have a question you might have that could not be answered through a simple search query, you could potentially answer by leveraging this information. So, uh, you know, we're talking about everything from lots of stuff

27:29

that deals with AI. Actually, like you were mentioning earlier, Joe, stuff like developing better natural language algorithms so that the the machines of the future can understand a wider variety of inputs and make meaningful connections between that input and the desired output. So in other words, I could talk to my computer or my phone as if it were a person, and no matter how I might word things in my own quirky way, the machine understands what I mean.

28:01

So it's not it's not responding to what I say versus it's more respond to what I mean, which would be awesome. Uh. Also stuff like speech recognition, UM emerging global trends. We mentioned that. You know, let's say that you wanted to track the outbreak of a disease and to try and get at you know, where did this start? How can we prevent this from happening again? That kind

28:25

of thing that can be really useful. And sometimes you're tracking this through uh, not like official documents, but through you know, people on Twitter saying, oh, I have the flu right, Yeah, it might be it might be social media, it could be news reports. I mean, it could be all these different, completely separate pieces that would be way too hard for you to put together on your own. Sure, we did a whole episode once about financial purposes for

28:54

big data, so play in the stock markets and all that. Yeah, yeah, exactly. So this is this is really important stuff, and we're gonna see a lot more of examples of people leveraging big data, especially now that it's outside the realm of just mega corporations right where we can see people are researchers who have interest in all sorts of different fields taking advantage of this massive amount of information that we continue to accumulate day after day. And uh, that's really exciting.

29:27

It makes me think of having access to the best research librarians in the world, all boiled into the largest library you can imagine. That's essentially what we're talking about here. So very exciting, and the common crawl is pretty inspirational.

29:46

It's one of those things where you realize it took a lot of determination to make that become a reality and the potential benefit, and it also is incredibly forward thinking to be all the way back in two thousand eight and putting the US together, And that was before we had really developed sophisticated tools that could leverage it properly. Now we're getting to see that as big data has become an industry unto itself. I mean, it's really exciting now.

30:13

So thank goodness that the idea was was implemented long enough, long ago enough for it to actually have uh you know, uh an established presence and now we can really see how we can take advantage of it. So Las is such a such a fascinating dude. I have found so many interesting interviews and stuff with him. I think that we should maybe maybe not on the show, but maybe if you'd want to do a text text episode kind of focus on them, that would be kind of cool. Yeah.

30:43

I love to do episodes where we are able to look at influential figures in technology on that show, So that would be fantastic. I will add that to the list. Uh So, the Common Crawl is really an interesting project. If you have not looked into it, go check it out, um, you know, because it may be one of those things that could come in handy if you're working on a research project. If you are just curious about how big data is gonna continue to have a huge impact in

31:14

our lives, go go seek that out. Yeah, it's a you can find them at a common crawl dot org. And if you if you're into making donations to nonprofit organizations that are tax deductible, you can do that thing too. That's pretty cool, all right, guys. That wraps up this discussion. If you have any suggestions for future topics for forward Thinking, you should let us know. Send us an email that addresses FW thinking at how stuff Works dot com, or drop us a line on Facebook, Twitter or Google Plus.

31:46

At Twitter and Google Plus, we are f w thinking. Just search fw thinking in Facebook and we will pop right up. Leave us a message. We read all of them. We look forward to hearing from you, and you'll hear from us again really soon. For more on this topic and the future of technology, visit forward thinking dot com, brought to you by Toyota. Let's go places,

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript