Web Spider Is Our Hero

Speaker 1

00:04

Welcome to Textuff, a production from my Heart Radio. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. I'm an executive producer with I Heart Radio and I love all things tech and listener David reached out to me on Twitter and said, I would like to hear an episode on search engine spiders. He is our hero. You gotta David, and if you get the spider, he

00:33

is our hero. Reference let me know. So we're gonna talk about the development of search engines and how they work from admittedly a pretty high level, because to go into great detail would probably take three or four episodes plus. Different search engines use slightly different strategies in order to index and rank search results. And the reason I'm doing all of that is because if we just talked about spiders, it would be a fairly short episode. But what the

01:03

heck is a search engine spider. Well, the index the contents of the world wide Web. You need to search around and find what's there first, right, you can't return results without first knowing what is out there in the first place. So a search engine spider is a bot that does this. It crawls through the Worldwide Web. Thus the whole spider name. We'll learn more about what's actually going on a little bit later, but to understand search,

01:34

we need a few more basics. So keep in mind that all the stuff we see online, whether it's a web page or it's a web service or whatever it may be, it ultimately sits on a computer that is connected to the Internet infrastructure, so it's connected to routers, which then connected to various servers and and domain name

01:56

servers and all that kind of stuff. If you visit the house Stuff Works homepage, that's the site for the company that I used to work for, don't work for them anymore, But that website consists of pages that are on a computer in a data center. If you happen to know the u r L for the site, so you happen to know how stuff works dot com, you can type that into a browser u r L bar address bar, and the browser will then take care of

02:23

sending the appropriate message to that computer. In this case, we will call it a server, and the server will then return the appropriate information to your browser the web page maybe the home page for how Stuff Works in this case, and then you'll see the website. But all of that requires that First, you have to know that there's a site there at all. Plus you have to know the U r L for it, and you might

02:47

not have that information. Before there even was a Worldwide Web, there was a need to know where you could find stuff on the Internet. Now, remember, the Internet is older than the Web, and the Internet and the Web or not the same thing. The Web exists on top of the Internet. It consists of a lot of other stuff besides the Web, right, like email and FTP servers. In fact, we need to really talk about FTP servers. FTP stands for File Transfer Protocol. So these are computers that house

03:21

certain files on them and through FTP. Through this protocol that allows for files to transfer from one computer to another across a network connection. People can thus access files that can transfer them from the server to their own computer, which in this case we would call a client. But again, FTP is really only useful if you know the address of the servers where the stuff is that you want. Right, you can't just use FTP to pull a file out of nowhere. You have to contact the proper server and

03:55

pull the relevant file from that server. Enter an Emtaj, who in nineteen nine was a graduate student at McGill University in Montreal, Canada. He also worked as a systems administrator for the School of Computer Science at the university, and he was running into a challenge. It was his job to locate software for professors, for staff, for students at the university, but there was no easy way to know where all the various files were on the network

04:26

of public FTP servers. Emtage decided there needed to be a way to get a snapshot of which public FTP servers had which files. There needed to be some sort of directory, and since servers were popping up more frequently as more people began to develop stuff for the Internet, there also needed to be a good way to search those lists to find something specific. Otherwise it would be like reading through an entire phone book to find out which person or business corresponded to a phone number you

05:00

happen to have seen. Let's say the phone number was eight six seven five three oh nine, and you don't know that that's Jenny's phone number. You just know it's the number. So instead of calling the number and asking, Hey, whose number is this, you get a phone book and you start searching for eight three o nine to find the corresponding name that is not efficient. In fact, in the early days, information about servers frequently had no other real channel to get to users other than word of mouth.

05:29

So there was a really good chance that there was stuff that was relevant to you that you just had no way of knowing about because you had to hear it from somebody else first. Imtash, along with a couple of other folks like Bill Healen and J. Peter Deutsch, began building a tool to solve this problem. They ended up calling this tool Archie, which actually was not a nod to the comic book character from Archie Comics. Instead, Archie was a somewhat shortened form of the word archives.

06:01

They created programs that could look through the repositories of public FTP sites and get an inventory of the files stored on those servers or as documented in the book A Rough Guide to the Internet by Nicholas West Quote, it combined a script based data gatherer which accessed listings from anonymous sites with a script which matched regular expressions which could retrieve file names matching a user query end quote simple right now. In case you're like me and

06:34

what I just quoted sounded a little bit confusing. One it really boils down to is to say they made a computer program that followed some fairly simple rules. The program made note of the file titles that were on various FTP servers, kind of like a list of contents, and they noted which files were on which servers. Another part of the program arranged those findings into a database not that much different from the types of spreadsheets you've

07:02

probably worked with in the past. Imtaj and crew also created a tool that would allow them to search this database. Before long, other people began to hear that he had this database and that they would ask him, Hey, can you do a search for me, and they would give him the search terms, and it started taking up a lot of his time. So in an effort to streamline things, he programmed a user interface or UI that would allow

07:26

people to conduct their own searches. They could just log into this tool and then type in the file that they were looking for and it would return the results for them. So as long as they were sure about the specific file they needed, they would get the results. Now, most resources generally agree that Archie was the first real search engine on the Internet, but it wasn't a web search engine, the Web didn't exist yet. It wasn't long

07:53

before a couple of other tools followed. In some researchers with the University of Minnesota develop a new tool to organize and discover documents stored on servers, and the tool was called the Gopher protocol. Servers were data repositories called Gopher holes eventually that's what they were called anyway, and Gopher organized everything into a hierarchical text based menu system. So this was a specific strategy that was built on top of the Internet. It was kind of think of

08:25

it as being in parallel with the Web. It predated the Web, but the Web and Gopher would exist at the same time, but they were not the same thing. This was a different strategy in order to serve information across the networks. Before long, like in more researchers developed a search function to work on top of Gopher. Because again, if you didn't know where something actually quote unquote lived in the Gopher network, you would never be able to

08:56

find it unless you were just lucky. So this search tool was called Veronica, and that is pretty cute because Veronica is a character in the Archie comics books. And while the search engine Archie did not pull its name from Archie Comics. Veronica was a nod to the older search engine as well as a nod to the comic book, so it almost kind of retroactively made Archie relate back to the comics. Later, computer geeks assigned a backronym to Veronica.

09:29

This is an acronym that you create after you've already named a thing. So you've given the thing a name, and then you're thinking, okay, well, what can we say. Each of those letters stands for that's a backronym, And in this case, the revisionist name was very easy, rodent oriented net Wide Index to Computer Archives, or Veronica super cute.

09:51

What Veronica did was fairly primitive. It created a database of every file and every directory on every go for a server that was connected to the Internet, and it

10:02

would update dynamically as more servers joined the network. That approach worked fairly well when there was still a relatively small number of servers to keep track of, But as more servers came online and joined this Gopher network, with more documents stored on each server, it started to get a really you know, challenging to manage Veronica, a secondary Gopher search tool kind of addressed this problem, and this one also took its name from a character from Archie comics,

10:34

jug Head. This one didn't create a full database of everything that was on the Gopher network. Instead, as a user, you would have to designate which Gopher server you wanted to search, so you had to at least have some general idea of where it was you needed to look. But if you did know that, it was a much faster approach than trying to search everything on the network as a whole. Gopher had a major problem, and that was that was becoming increasingly less efficient and easy to navigate.

11:06

The larger it got, it didn't scale well. Meanwhile, at the same time that Gopher was growing, a guy named Tim berners Lee over at CERN. You know that's the research facility that oversees stuff like the large Hadron collider. Well, he was developing a different approach to storing and sharing information across networks. Tim and his team at CERN developed a protocol called Hypertext Transfer Protocol or h t t

11:34

P and Hypertext Markup Language or h t mL. Both of these kind of grew out of stuff that CERN had been using internally for a while. Now I'm guessing those terms sound familiar to you, guys. These are the two components that really formed the basis of web pages and the Worldwide Web. The markup language acts as the set of instructions on how a computer, or more specifically, how a browser is to interpret and display documents, eventually

12:04

including stuff like images and sound files. Although initially the web was strictly text based and browsers were text based as well. H t t P is the set of rules and the processes through which a client that being your web browser, can request a specific document from a server, and how the server can then send that requested document to the browser. The server since the h t m L files to the client, and the client interprets those HTML files in order to display the relevant web page

12:37

to the user. Hypertext refers to text that has a link to some other text, and you can think of it kind of like a footnote in a book. The hypertext has an asterisk that corresponds to another piece of information somewhere that is also marked by an asterisk, except in this case the asterisks are invisible. It's highlighted text or or text in a different color, or it's underlined. It's designated in some way to be different from all the rest of the text. That's what lets you know

13:10

it's hypertext and it's linked to something else. Hypertext documents connect to one another through hyperlinks. Those documents don't even have to be on the same server. They can be on opposite sides of the world. So this means you can build a reference in one hypertext document to content that's found on a totally different hypertext document. Clicking on

13:31

that hypertext activates the link. It sends a command to the browser, which then relays that command to the server that the client wants to see specific linked information, and the server returns that. You can also link the locations that are within the same page of a document, or specific locations on other pages. It doesn't have to just be click on this and you go to a new web page. It might be click on this and you skip down, you know, a significant number of paragraphs to

14:02

get to the relevant information. Really, all the link is doing is telling the browser where some specific point in a specific document happens to be and how to get there.

14:13

It's kind of like if you were reading a book that said I want to know more, skipped a page to nineteen and read the third paragraph or or sometimes I compare it to those old choose your own adventure books where you get to the bottom of a page and you have to make a decision, and based on which decision you make, you have to turn to a specific page to pick up the story again. Well, you can quickly see how this would be really useful. Let's say I want to make a web page that includes

14:37

directions and how to perform a particular process. We're gonna call it baking a soup FLA and the steps I list include references to other, maybe slightly less involved processes that are part of this, and I don't go into explaining how those work. Let's say, like I talk about cracking eggs, but I don't tell you the best way

14:56

to crack an egg. However, I could create hypertext links to other sets of instructions, maybe a specific page just on different ways to crack an egg, And that way you could go and look that up if you weren't confident. So if you don't know how something works, you can click on that other link and go to a page

15:13

let's dedicated to that in order to learn more. And yes, I just described how the web works in general, which is something I'm sure you all know at least at some level, even if it's not you know, a formal one. But if you've ever been on say Wikipedia, reading up on a topic and saw hypertext link and thought, yeah, I should find out what this term means. I don't understand it. So you click on that and you go follow that so you can get better understanding, then that's

15:40

the use case I'm referring to. And there's a conversation we could have about what actually goes on when you click a link, but that would require a deeper dive into how the web and by extension, how the Internet works on a very technical level, and I think that goes beyond the scope of what we're trying to do

15:57

in this episode. So I'm going to simple five, perhaps to a ludicrous degree, and say that a link contains within it the information about where another document, or even a specific point within another document exists, and activating that link by clicking on it in a browser initiates a sequence that results in the browser requesting that specific document from the appropriate server, which then sends that document to

16:24

the browser so that you can see it. A lot more is going on to make this happen, but let's just stick with that high level view. So the pair of h T t P and HTML evolved the same time that Gopher was establishing itself, and some people stuck with Gopher, but it just really never took off the same way that the web did with HTML and HTTP. The protocol and the markup language is what the web is built upon, and we call it a web because

16:54

of that interconnectivity of documents. You can build out a dock and then link that document to another doc which might be linked to a dozen others, and by following those links, you can navigate from one document to the next. It really is similar to what happens to a lot of people when they visit Wikipedia and they just start following all sorts of links. But you might already see

17:15

a challenge with that kind of design. It works great if you've got a centralized person or institution that's building out the web, adding pages in a very logical way and linking to them in a very logical way. But one of tim berners Lee's major goals was to create a democratized system that didn't depend upon a centralized authority. People should be able to build their own web pages and host them on their own servers. But how would anyone else find them if there were no links going

17:46

into those pages. If the web pages are made and hosted independently of the first few pages on the web, where is the connective tissue? The original solution wasn't a search engine. It took a it more of a hands on approach. I'll explain more, but first let's take a quick break. In the early days, when people first started building documents to host on the web, in other words, the earliest web pages, Tim Burner's leave would take it upon himself to create an index hosted on cerns own server.

18:28

Someone might send him a message saying that they had built in are hosting a new web page, and they could include the address or U R L. Burners Lee would then add a hypertext link to a growing catalog of those kind of links on a page hosted by CERN. So if you visited cerns site, you could navigate to that index and see the links to the other sites. Tim came to call this a virtual library. He and

18:55

a group of volunteers oversaw its evolution. They organized it into different areas of interest, with subject matter experts overseeing specific categories, and a lot of these early pages belonged to scientific research organizations or universities or publications, and all that makes sense. CERN is the organization that oversees the large Hadron Collider after all, So it's no surprise that

19:17

the early web really focused on science and academia. Also, it's good to mention that the web in those early days again was text based. Browsers were text based too, that would not really change until another year like nine. Mike Gopher's design, the virtual library approach worked fairly well when the Web was still small in scale. According to the Virtual Library website, in August nine there were about

19:46

twenty web servers in existence total. A little more than a year later, in October of nineteen, it was more than two hundred web servers, so growth was still fairly modest, but things kind of took off after that. By January nine six, there were more than one hundred thousand web servers. The following year there were more than six hundred fifty thousand. It was growing so fast, and maintaining an index was becoming increasingly more difficult, particularly by doing it, you know, manually.

20:23

The virtual library was taking shape around the same time that students were building the Veronica search engine for gophers, so all this was happening around the same time. I know it sounds like I'm going strictly chronologically, but that's just too not too helpful. We have to remember this is all happening simultaneously. So as the web grew and

20:43

became more complex, indices were growing as well. Just navigating an index to find what you wanted would become a challenge, particularly if you weren't thinking in the same way as the people who had organized the index. This is where taxonomy comes in. Taxonomy refers to a system of classification. A taxonomy is a set of rules we use to organize stuff, and there is no one way to do it correctly. So I'll give a simple example. Let's say you've got a class of students and you have them

21:16

all divide up into smaller groups. You give each group a pile of documents, the same documents per group, but you tell the students it's their job to organize those documents. Well, one group decides that they're going to organize all the documents by alphabetizing them by title. The title of each document will determine how they fall in the pile, so there's are all in alphabetical order. Another group decides that they're gonna bundle their documents that all cover the same

21:45

subject matter together, and they'll alphabetize within subjects. So they might have a stack that's just about biology, another stack that's about chemistry, another one about material science, and so on. A third group bundles their docum it's together by author. They put all of the same author's works together, and then maybe they alphabetize the authors. Yeah. Another group focuses on publication date. They arrange all their documents in order

22:12

of when they were published. So you can imagine combinations of these approaches as well, right, such as ordering documents chronologically, but then if two documents were published on the same date, you then alphabetize them. That kind of thing. You can think of those different sets of rules, and you have to determine which rules are most important, right, which one you do first, and which is secondary. Well, it's important that these taxonomy's, however you construct them, are consistent or

22:40

else it becomes a chaotic mess. But even a well organized and maintained taxonomy can still be a challenge for someone new coming into the system. And that's really what I'm getting at here. A comprehensive index will seem incredibly overwhelming to someone who's unfamiliar with the system's taxonomy, and it will still seem like finding a specific document or web page is an impossible task. Once that index grows to a large enough size, clearly a search tool would

23:11

be a big solution to that problem. If you can type a query into a search engine and you can get the results you want, you don't have to comb through an enormous index hoping that you're thinking in the same way as the custodians of that index. But in addition to those challenges was one of scale building. This index required a lot of volunteer effort because again it was done by hand. The next step toward the development of a search engine was a project that was tackled

23:41

by an m I T student, Matthew Gray. That was the student. He designed a program he called the Worldwide Web Wanderer, and the purpose of this program was to automatically navigate across the web, cataloging the web's growth by registering new websites, web pages, and web servers. The Worldwide Web Wanderer is arguably the earliest automated spider or web

24:07

crawler designed for the Web. When Gray developed the program back in the web browser Mosaic, which was the first popular browser designed for Windows, was just a couple of months old. Mosaic was also a graphical browser, and while it wasn't the first graphical browser, it was the first popular one available to the average person outside of places

24:30

like CERN. Gray wanted to automatically detect new websites for discovery purposes, but before long the number of sites was growing so quickly that he shifted his attention to charting the growth of the web in general. Gray wrote his program using the Perl computer language. That's pe r L,

24:50

And here's a quick refresher on computer languages. We know that typically computers process information in machine code, which mostly for our cases means binary data, and that means all the information going through the machines ultimately breaks down into zeros and ones, and you can think of that like a light switch flipped either off or on. Now, a single offer on is easy, but if we want to represent more complex ideas processes that kind of thing, we

25:22

need a lot of bits. A single alpha numerical character in the as key code requires seven bits, so you need seven strings of zeros or ones just to represent a letter, number or symbol and as key. So you can imagine that programming in machine code would be really tough for most humans because it would be so easy to mistype a zero or a one, or to skip one. And if you're, you know, typing out a really long sequence, it's easy for you to overlook a zero or a one,

25:56

And that's why people developed programming languages. A programming language creates a layer of abstraction between the programmer and the machine or system of machines. It acts as a sort of interpreter. It takes the intentions of the programmer and turns them into processes a computer can respond to. Some programming languages are closer to machine code. Those are low

26:21

level programming languages. They're typically really challenging to work with, and others are more abstract and thus easier for us to work with, and these are called high level programming languages.

26:32

Pearl falls into that high level category. Now. Originally, Gray's program would seek out links on web pages and then note the web servers that were hosting those pages before following the link over, and then it would repeat that process, and the program was really just automating the process that we would do manually if we were to look at a web page, see a link, and click on it.

26:56

The program saught the links embedded in pages and then act made those links to explore whichever documents were pulled up as a result, and then would repeat that process while building out this index, kind of leaving a trail of where it had been. The program built what Gray called the wand DECKX W A and d e X. It was an index of web servers that we're joining

27:19

the Internet. Not long after launching the Wanderer than Gray built in additional capability of capturing the u r l's that it was going through in addition to just the web servers, so you can think like originally he was just like, I wonder how many web servers are connected to the Internet, how many are connected to it today, how many will be connected tomorrow? That kind of thing, and just sort of keeping track of how the web

27:41

was growing. Then he thought, I want to know actually the u r ls that exists too, so he's keeping track of both. This didn't go totally smoothly, however. The Wanderer was an energetic little spider. It would move through

27:54

links throughout the day. It would index the same pages hundreds of time in the process, and this started to cause network lag across the Internet, and it meant that people who were just trying to navigate to those pages were experiencing long delays as a result, and that made people, let's say, a little miffed at Mr Gray, and he was able to modify the spiders operation so it wouldn't cause so much of a disruption, but that early enthusiastic

28:23

mistake kind of created a tough environment for other people who wanted to create similar tools that would allow for fully fledged searching on the Internet, and the concept of spiders had kind of a big old X next to it in the minds of many people. It became synonymous with this idea of lag and just bad network performance. The Wanderer didn't index sites for content either. It was more about tracking the growth of the Internet as a whole.

28:49

It wasn't so much concerned with what was on pages, so it didn't create a means to search for specific web pages or subject matter. Meanwhile, over at the University of Geneva, a developer named Oscar near Strats developed a tool that could search lists of websites and return results based on a query. This tool didn't actually survey the

29:11

web as a whole. Instead, it would access lists other organizations had made, such as the Virtual Library, so it would reformat those lists as entries into a database and that's what could be searched. It was called the W three Catalog. It still wasn't quite a search engine as we think of them today. Martin Costa developed another tool in late which he called the Archie like Indexing for the Web or ali Web as the name suggests. He was taking inspiration from the Gopher search tool of Archie.

29:48

Ali Web needed web administrators to provide the location for their site index files so that they could be included in the ali web search database. Users could also create dis cryptions for the websites and add in keywords to help with search. And this brings us into the world

30:06

of metadata. Metadata is information about information. So you've got the core information of a web page, what we might consider, you know, the content of the web page, the stuff that you and I would actually read if we went there. But then you've got the metadata and that describes the

30:23

information in some meaningful way. Now, if you were in a physical library, the metadata would be the stuff that you used to help locate where in that library a specific book should be, and that could include stuff such as the author name, the publication date, the subject matter,

30:41

that kind of thing. And it was a pretty good idea ALI web but not many people knew about it, and those who did know about it, not a lot of them went through the trouble of submitting the information to ALI webs, so it didn't really see much widespread use. Also in and then extending into was the development of another search tool, and this was the brain child of a guy named Jonathan Fletcher. He was a grad student at the University of Sterling in Scotland. Fletcher's approach combined

31:12

the strategies of his predecessors. He built a web crawler to find and index web pages. He designed the database to be searchable and he called it jump Station. Unfortunately, his efforts were limited by the budget that he got

31:28

from his university. He didn't have the resources to really build out a tool that could index all of a website's contents, so instead he designed jump Station to parse web page titles and headers, and that would still help people find pages that, at least according to the title and header were focused on whatever the area of interest was.

31:49

But it would also mean that other pages that might have critically relevant information about the subject could be overlooked because those terms just weren't in the title or header. We're getting closer to the search engines that would most resemble what we think of today, which, let's be honest, is primarily Google, and we will learn more about those

32:10

after we take another quick break. I think you could argue pretty convincingly the jump station was the first true web search engine as we have come to understand them, though it was limited since it couldn't crawl through and index all the contents of a page. The next name on our journey is one a lot of people will recognize, and that is Yahoo. But Yahoo didn't start off as a search engine. Rather, Yahoo was originally a web directory.

32:46

It started in as just a list of websites that the founders of the site that would be Jerry Young and David Filo they thought were interesting. There this is a cool website. I want more people to know about it, so I'm gonna include did on my web page about cool websites. So Yahoo started off as another curated list

33:06

of web pages. The search tool aspect of Yahoo would follow in The search tool worked on the sites that were curated in the human curated Yahoo directory, but if a site wasn't in that directory, it wouldn't show up in search results. So someone would have had to have found the website already and then included it within Yahoo's growing directory for it to register as a result. Following Yahoo were a couple of other notable names. There was

33:36

info Seek and web Crawler. Webcrawlor was the first search engine I remember using. In fact, I stuck with web Crawler for a long time, even after the infamous Google emerged and started making waves. Web color did something that other search engines had not yet done. It's index was looking at the full content of a web page, including the meta data on that page. So let's talk about

34:03

that for a second. Web spiders, when you get down to it, are just bots that follow links, but some web spiders can also make a full index of the content found at each links destination, essentially scanning all the text that's within a web page and indexing it so that that content is searchable and that searchable index forms as a result of all this, and it can bring back any results of any pages that contain a specific word.

34:32

Let's use an example. It makes it easier. So let's say you're in a literature class and you're having a real hard time understanding Milton's Paradise Lost. So you're looking for some resources to help you get a better handle on things. You go to a search engine on the Internet. It doesn't really matter which one, and you type in Paradise lost Milton analysis. You're trying to really cut down on anything that might just mention paradise or loss or

35:00

anything like that. You really want to focus on this. This part of the search engine is the UI or the user interface, right, This is the part that we as humans interact with in order to tell the engine what it is we're looking for. The search engine then goes and consults it's index of the web. So no matter which search engine you're using, it's not a representation of every single web page that exists. It's every web

35:26

page that exists within that engine's index. So each search engine has its own index, or in some cases search engines are powered by other engines. It may be sharing an index with another search engine, but it looks for documents in that index that contain the words that you have submitted in the UI, then it has to return those results to you, which also means that the search engine has to determine which of those search results are

35:55

likely to be the most relevant to your query. This is actually harder to do that, and it sounds if a surge engine is only looking for documents that happened to contain the words that you've submitted, you could get back pages that have little to no relevance to what you actually wanted. Plus, some web page administrators, especially back in the early days, we're really trying to game the system. They might use tricks in order to get more people

36:21

to come to that web page. And it might be because their web pages had banner ads on them and so more people visiting the page meant more money, or maybe they just wanted bragging rights. Because some of you guys might remember this. It used to be back in the day that one of the standard features you would see on web pages was the ever present web counter that would tell you how many people had visited that

36:44

website since it had been created. And a few folks were hoping to just spread malware by tricking people to visiting a website and downloading some malicious program. And then there were also link farms. These were sites that were just one long list of links to other sites. More on why that's important in just a second. One trick was to include just a ton of different popular search terms on a page, even if the page had nothing to do with any of those search terms, and you

37:17

could even hide that. You can make the text and background the same color, so a human visiting the website and looking at it through a standard browser wouldn't see anything because the background color in the text is the same color. They see whatever the content of the web page was, but they wouldn't see all these hidden keywords. But a computer would totally see it. It would ignore the fact that the font and the background color are the same and it would just pick up on the text.

37:44

So you would end up having these false returns on search results because those keywords were there in the page, they just weren't relevant to whatever the content was. Other administrators would put keyword dumps into web page meta data, so wouldn't show up on the page itself at all. It would all be in the background. Following a search result like that would be really frustrating because you wouldn't actually get whatever it was you were looking for, you

38:10

would get something else. It was a bait and switch. So building search engines meant not only did the developers need to figure out how to build in disease that could grow as the Web was growing, they also had to figure out how to defeat strategies that were intended to game the system. How can you make sure the people who are using your search engine are actually getting the stuff that they want, because if they're not getting the stuff they want, they're gonna bounce. They're never going

38:37

to use your search engine again. I'm gonna use Google as the example for this, because i mean, let's be honest, Google is dominant in that space. It's almost like it's the only game in town. But just know that all search engines, in general, we're all trying variations on this

38:54

kind of general philosophy. Google's approach used a tool that they called page rank, which, as the name suggests, would take the documents that came back from any given search, then rank those search results before presenting them to the user. So if you went to Google and you typed in Paradise Lost Milton analysis, Google would consult its own index of the web, and it would look for stuff like,

39:22

are the search terms showing up in the page? Is examples of words that are close together, because that might indicate that this result is more relevant. Right, if these words are all kind of next to each other, it's more likely to be what the person was looking for, as opposed to, Yeah, all four of those words are showing up on this page, but they're so far apart, then maybe this isn't even related to what the person

39:47

was looking for. That was part of page rank. The tool also would look at things like the title of the page and maybe even the header, but it mostly ignored the metadata because you know, search in gen designers were picking up on the tricks people were using in order to get more clicks. At the same time, the search algorithm would assign ranks to pages based on a few other points of criteria. The algorithm attempted to figure out how reputable every page was, and it did so

40:17

in a couple of different ways. One was to look at which other sites were linking to that page. If the other sites that were linking to it were considered generally reputable, that would improve the results page rank score. So in our case, and this is a totally unrealistic example. But let's say we've searched that Paradise Lost Milton analysis and all we got back our three results, but Google has to rank those results as one, two, and three.

40:48

One of those results is from a website dedicated to Paradise Lost, the literary work and has literary analysis on it, and it sits on a server that belongs to a prestigious university. Let's say that the second result is coming from a literary discussion site. It doesn't belong to a university, but it does have critical analysis and an entry specifically on Paradise Lost. And let's say that the third result is Billy Bob's Homespun Guide to Milton and Crab Trap

41:18

Maintenance or something. Now, the algorithm is not smart enough to actually read each of these sites as a human would and judge them and analyze them and weigh the value of each one, but it can see that the university server is, you know, it belongs to a university. It's generally treated as the property of a recognized authority, and so it sees that other reputable sites are also linking to that university's web pages, and to that Milton page in particular. So it assigns that result a very

41:52

high page rank, saying it's probably pretty darn good. That also means it's going to appear higher on the list of search results. Meanwhile, Billy Bob's is likely to appear at the bottom of that list because very few people are linking to it. It might be hosted on just some server somewhere that happens to host a whole hodgepodge of different web pages, and the page that's on that site that has just sort of literary analysis discussions on it,

42:19

that one appears in the middle. Now, could Billy Bob's page actually be the best resource? Yes, it could be, but without a human or maybe a really incredibly advanced AI to review the contents of that page and to really understand them, the ranking approach seemed like the best way to quickly organize results to give the best chance that the returns were going to be relevant to the user. Now,

42:47

in that example I just gave, I mentioned three results. However, if you were to really perform that search, because I did it before I recorded this episode, you would get millions of results. In fact, just for a laugh, I went to Google typed in Paradise lost Milton Analysis, and I got quote about three point eight million results end quote,

43:10

that happened in less than one second. Page rank becomes really important when you get to that level of response, when you get to that many results, if you're talking about that enormous amount of information, you really want the most relevant choices to be near the top to save yourself time. And that has created some pretty bad habits for us as users. By the way, we've become so used to search engines returning the most relevant results right at the top that we don't necessarily bother to look

43:42

beyond the first few sites. There are a lot of resources out there that have estimates on how many people actually bother to ever go past the first page results, and some of them even say that as much as of all web traffic will just go to results that appear on the first page for any given search, and that means that all the other results that appear after page one are sharing just five percent of the web traffic.

44:13

So when I did that Paradise law search, that first page of results had nine websites linked to it, plus a few videos. That means somewhere around three million, seven nine web pages are sharing just five percent of leftover traffic that go to that first page. So they might include incredible resources that are even more relevant than the stuff that appears on page one, but very few people are going to that. That's one bad habit that we've

44:47

all developed through using these search engines. On the flip side, that message is that it's really important to get your page to show up on that first screen of results. If you're bill holding a web page about a specific thing, you want to be on that first page because otherwise you're gonna have to hope people find your your website

45:07

through some other means, you know, outside of search. That gave birth to the industry of s e O or search engine optimization, which is a constantly evolving set of practices that web designers try to follow in order to rank better in search. And whenever a search engine, which again these days we mostly just mean Google, whenever Google makes a change in its algorithm, it can really upset the apple cart, and it can push everyone back to the drawing board. It can completely jumble up who appears

45:39

at the top of search results. Now, all of that is another kettle of fish, So I'm going to leave off of s e O and go back to that on some other day, but more Germane to our episode here is that the spiders, those web crawling bots, or what build out those indices that search engines used to

45:57

give us the results we ask for. There are some things I did not cover, such as tags that web developers can use to make sure that search engines just pass over their websites or sometimes just pages within their websites without adding them to an index, so they'll never show up in search. But we could go over that in a future episode. Two. For now, it's kind of time to wrap things up, So guys, I hope you

46:22

enjoyed this episode. If you have suggestions for future topics, whether it's a specific technology, a trend in tech, a person in tech. Maybe it's a company you want to know more about, let me know, draw me a line on Facebook or Twitter. The handle of both of those is text stuff H s W and I'll talk to you again really soon. Text Stuff is an I Heart Radio production. For more podcasts from my Heart Radio, visit the I Heart Radio app, Apple Podcasts, or wherever you

46:55

listen to your favorite shows. Two

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript