The Internet Archive

Speaker 1

00:04

Welcome to tech Stuff, a production from iHeartRadio. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. I'm an executive producer with iHeart Podcasts. And how the tech are yet. So let's take a little literary trip. In Anthony Burgess's a clockwork Orange, the extremely wicked protagonist it's putting it lightly. At one point early early in

00:32

the novel, reflects on the nature of permanence. He thinks the reader might not remember what milk bars were like due to quote things changing so scory these days and everybody very quick to forget, newspapers not being read much

00:49

neither end quote. Alex in this case is saying that the combination of the world changing very quickly scory is derived from a Slavic word meaning swiftly or quickly, and people having short memories means that referencing something that happened even just a few years ago might mean you're met with blank stares because the world has moved on. Now take that same sentiment and crank it up to eleven when you talk about the Internet in general and the

01:18

Web in particular. So, on the one hand, we know that the rule of thumb is that once something gets posted online, that's kind of it, right, it's sort of perpetually online. Like that's kind of the joke. Like once it's up, it's up, and you can take it down, but there's going to be a copy of it somewhere. So even if the originator tries to take down whatever the stuff was, somebody's got it. But on the other hand, we also know that so much stuff gets added every

01:46

single day to the Internet. There's actually a colossal mountain of content out there that just keeps getting bigger moment by moment, and everything that came before it can end up getting buried in the process. And sometimes stuff can be added and taken down without anyone being the wiser. Now,

02:03

on top of that, web pages obviously can change. A website might adopt a new format or style, might incorporate new technologies and interfaces that are added to web browsers, or it might choose to remove sections that once might have been relevant but maybe now not so much. Or entire websites could disappear as servers go offline or companies go bankrupt, or you know, web administrators just lose interest. The entire spectrum of human output can be found on

02:36

the web. Not every instance of human output, but an example of everything is out there. Everything from deep philosophical musings to the most banal posts you know, which often revolve around what someone is having for lunch. All of that finds its way to the Internet. And while you might argue that a lot of it, or perhaps even most of it, is it really worth the time it

02:59

takes to consume, let alone keep it around. There is undeniably a huge amount of valuable data out there too, but there's no guarantee that it will stay there or remain easily findable. And that's where today's topic comes in. I wanted to talk about a project that began back in nineteen ninety six. It's a project that aims to preserve as much of the Internet as possible and little

03:22

slices of time, little snapshots. Not only does that mean you can potentially dig up something that hasn't been online for years, but also you can get a look at what different sites were like in various eras of the Web. It could be a really eye opening experience to see something like Amazon and what it looked like, you know, shortly after it launched, compared to what it looks like today. So we are going to talk about the Internet Archive. Now.

03:48

To do that, we need to talk a little bit about the people who founded the ding dang darn thing, and that would be Brewster Kale and Bruce Gilliat. So Klee graduated from m with a degree in computer science and engineering. After he graduated, he joined fellow MIT graduate Danny Hillis, who had created a company called Thinking Machines. So this was a super computer company. His team specialized in building massively parallel computer systems, mostly with the aim

04:17

of building machines for AI research and development. So yeah, Calee was working on the challenges of providing AI researchers with the compute power they need, decades before our current AI explosion. Bruce Gilliot is also a computer scientist, and that's just about all I know about him. I mean, I know he is, or at least was married, and I also know he owned a series of very impressive houses in the San Francisco and San Jose areas because it made the news whenever he sold one or bought

04:46

a new one. But other than that, there's precious little information about him that I could find, which is somewhat ironic. When you consider that he has dedicated a lot of time and effort to preserving information on the Internet. He would go on to co found the company called Alexa Internet with Brewster Kale, but that's getting ahead of ourselves.

05:04

So most of my story will center around Kale simply because out of the two co founders, he's the one who acted more as the face of the efforts, and Gileat, from what I can tell, has just been really good about kind of maintaining a very personal private life. So I don't mean to diminish Gileat's contributions, but at the same time, you know, I can only cover what I can find. So in nineteen eighty nine, Kale, along with a colleague named Harry Morris, created an innovative tool for

05:35

the blossoming Internet. Now remember this is the Internet. It's not the Worldwide Web. It didn't exist yet the Web the Internet did, and the tool they created was called the Wide Area Information Server or ways WAIS. So people could create a web server. They could host documents on their web servers. But finding these documents was really hard because you didn't necessarily have hyperlinks connecting one document to

06:04

others and vice versa. You didn't have an easy way of even navigating through different documents from one to the next. So it was almost a case that you needed to know where something was and what it was called first, and then you could go to the relevant server and retrieve that document. Otherwise the document would just remain quietly sitting on some server somewhere and no one would know

06:30

about it. Now, that is antithetical to the entire purpose of a wide area information sharing system, because, I mean, the name tells us the whole purpose of this technology is to allow information to be widely shared. Jeremy Norman's History of Information lists ways as quote the first Internet publishing system, just predating Gopher and the World Wide Web end quote. In a recorded presentation to some Xerox employees, Kale laid out personal perspective on what he wants from

07:03

his experience on the Internet. So first up, he said he wanted his own personal information to be easily accessible by him. Specifically, not that it should be easily accessible to everybody, but specifically to him. He wanted the ability to get access to all the different stuff he generates, like articles and such, and to make it really easy to do that. He also wanted the ability for publishers

07:25

to get their work to him. So in Kal's mind, the best approach would be for published works that are relevant to his interests to find their way to him, as opposed to Kale having to go out and hunt down these published works himself. And he pointed out this is what publishers want too, because you wouldn't publish something unless he wanted folks to actually read it. He also said that he wanted this technology to be usable anywhere. He wanted people to be able to access it no

07:51

matter what kind of device they were relying on. Now he was specifically referencing laptops at the time, but he was also saying that portable computer systems, essentially things that would become smartphones and tablets, were on the horizon and that these needed to be able to access that stuff too.

08:06

And he said that he wanted people to be able to use what he had learned should he choose to share the information, that if he had come up with something that was useful and he wanted to share that, he wanted other people to be able to access that. Cale didn't say that people should be compelled to share, but if they wanted to it should be possible to do so. Ways was Cale's attempt to bring these ideas to life. In that presentation to the Xerox employees, he

08:34

defined ways as electronic publishing. He further defined that term to mean the distribution of information. So whether the end user was to look at this information on a computer screen or they just chose to print out the information and then read it that way, that was beside the point. Electronic publishing was all about how information got from the originator to the end user. That's what made it e

08:58

publishing that it was publishing over wires. Now, one thing Cale introduced in this presentation to Xerox was this concept of conducting searches using natural language. This concept is one that we're really familiar with today. You enter a query into a search bar. You describe what it is that you want to know or learn about, or have access to, or retrieve or whatever. This search engine brings back search results that are ordered by some kind of relevance depending

09:26

upon the search engines, you know, various algorithms. How the search engine determines relevance really depends upon the system itself, of course, Like you could run the same search across different search engines and get very different results based upon that methodology of determining relevance. If the system believes it's relevant, it may or may not be relevant to what you actually want. Like hopefully the two are aligned. If it's a really good search engine, then you're going to get

09:53

something that is actually meaningful to you. Anyway, Ways was kind of following in that approach back before there was a World Wide Web, you know, when you just needed a way to find stuff that was being stored on these Internet servers and to be able to retrieve these documents to make use of them. Otherwise you had this incredibly powerful communications tool, but it was so challenging to use in a meaningful way that the information stored there

10:23

would be not that useful. I think of it akin to imagine that there's this one remote library and it's tiny, but it has the world's only copy of some text. But this libraries in the middle of nowhere. It's really hard to get to the fact that that library has that document would not be terribly useful to most people, and so it might as well not have the document

10:47

at all. That's kind of what Ways was trying to do is solve this problem of making it easier to get access to this wealth of information that Kale saw was only going to get more complex and more full of data. Well, we'll move away from Ways, because we could do a full episode about that. I will say that Cale and Morris, the founders of Ways, the guys who created the Ways technologies, would actually leave Thinking Machines and they would found a spinoff company just called Ways Incorporated.

11:20

And it was around this point when the mysterious Bruce Gilliot joined the team. And while the Worldwide Web would debut in the early nineties, which really opened up accessibility to information on the Internet for a lot of people, most of them for the first time, Ways would continue to remain relevant. In fact, it was relevant enough that in nineteen ninety five AOL would come calling with an offer to purchase the company for a cool fifteen million dollars.

11:46

If we adjust that for inflation today's money, that would

11:48

be around thirty million bucks around that ballpark. Now, a lot of the folks that Ways Incorporated would split off to create new companies after this acquisition, and within a year that included Cale and Gileat, who went on to found a new company called Alexa Internet and you might think, huh, Alexa, you mean like the same name as the Amazon Digital Assistant, And yes, exactly that, because, as it would turn out, Amazon would ultimately acquire Alexa Internet just a few years

12:21

after it was founded. But the name derived from the Library at Alexandria, the ancient library of Egypt that at one point housed one of the world's largest collections of accumulated knowledge. Now around forty eight BCE, Julius Caesar Julie Baby and his boys they barged into Alexandria, and as a consequence of their rowdy invasion, the library caught fire and much of the collection burned. Sadly, that was not the only indignity. In fact, it wasn't the first indignity

12:53

that the library suffered that would impact its relevance. Further conflicts a couple of centuries later pretty much wiped out whatever had been left from the previous calamities, and the Library of Alexandria became kind of a touchstone for folks who have stressed the importance of access to knowledge and the protection of that knowledge, and that the consequences that could follow from the loss of such knowledge can be

13:15

really dire. See also like the Middle Ages the Dark Ages, for example, that loss of knowledge is a really terrible thing.

13:24

So the impetus for Alexa Internet was that Cale and Gillat wanted, in the words of the Web Design Museum quote, to develop advanced web navigation that would continually improve itself on the basis of user generated data end quote, which is a pretty advanced idea for nineteen ninety six when the Web was still very young and the general public was still just trying to get a grip on exactly what the Web and by extension, the Internet were. One of the first tools that Alexa Internet developed was a

13:54

browser toolbar. So installing this toolbar into a browser would give the user's access to a sort of crowd powered recommendation engine. In some ways, it's not that different from sites like dig and Reddit that would later rely on the user community to actually work and to recommend links to really interesting sites. This toolbar would recommend the sites to users based upon how the overall community was browsing.

14:20

So the more people who were using this toolbar, the more information was going into where they were going, and thus you would get different recommendations. So if a lot of people were navigating to a specific site for whatever reason, you might get a recommendation to do the same. It was an attempt at an organic way for folks to suggest websites, kind of like a word of mouth campaign, and Alexa Internet would also provide meta information about websites

14:45

to users if they wanted it. Meta information is information about information, so this would include stuff like how many web pages were part of an overall website, or how many other websites were pointing back to the one you were on, and so forth. A lot of the stuff that Alexa Internet could tell you would reflect a specific web page's relevance. It's the same sort of information that search engines like Google would take into account when deciding

15:10

relevance for search results. And that meant that it didn't take very long for Amazon to come around with an offer to purchase Alexa Internet. I'll talk about that more, as well as the development of the Internet Archive after we come back from this quick break to thank our sponsors. So Amazon in nineteen ninety nine takes a look at

15:40

Alexa Internet and says, Wow, this is pretty incredible. This little company has created some means of checking for stuff like relevance and metadata that could be really really useful for us, And so Amazon made an offer that Alexa Internet couldn't refuse to acquire the company for the and slee some of two hundred and fifty million dollars in

16:03

Amazon stock in May of ninety nine. So this is a little different than the earlier deal we talked about where AOL bought you know, the Ways Incorporated, because they bought it with two hundred and fifty million dollars with a stock. If we just treated that like it was a cash exchange, then if we had just for inflation, that's like around four hundred and sixty nine million dollars worth of stock. But that's not really how you deal

16:31

with the value here, right. You have to think about how much was the stock worth back in nineteen ninety nine versus how much is the stock worth today? I checked and I saw that in May of nineteen ninety nine, Amazon stock was trading for around two dollars eighty nine cents per share. These days, it's closer to one hundred and eighty dollars per share. Plus. Between that time, Amazon had two different stock splits. There was a two to one split in late ninety nine, and there was a

16:59

twenty to one stock split in twenty twenty two. When you factor all that up, that two hundred and fifty million dollars in stock ends up being a ton of wealth. Like it's just a huge amount. It would take a lot of calculating to get an estimate, and even then it wouldn't really be accurate just say that deal is worth a lot. So anyway, the important thing with the Internet Archive is that Cale and Gileat, through their work and creating tools for Alexa Internet, found themselves able to

17:32

create snapshots of the Web. So they were using Alexa Internet to have a commercial business, and they established the Internet Archive as a way of preserving information that had, at some point or another found its home on the Internet. So they were using Alexa Internet tech to crawl the young Web in order to index everything, which is a necessary step if you want to give people access to the various documents posted on the web. We first have to know what is there and where is it. To

18:02

do that, you've got to index everything. And then they said, well, now that we are able to index this, we could

18:09

actually download these little snapshots and keep them. And according to the Internet Archive, that would be important because the average lifespan for a new web page was not very long, So contrary to our belief that once something is posted to the Internet, it's there forever, the archive found that on average, new web pages stuck around for about seventy seven days, which means it's less than three months, and then puff they would disappear, like maybe they would change drastically,

18:42

maybe they would just go away. Now, imagine that you were to walk into a brick and mortar library, but then you found out that on average the books in that library would only stick around for three months before being lost forever. And think of all the knowledge that would disappear on a regular basis and ongoing basis. It would be impossible to calculate the impact of that kind of reality. It would be like losing the Library of

19:06

Alexandria regularly every three months. So Cale had come to the conclusion that knowledge should be preserved and made available for posterity. This is similar to an idea that was proposed by Stuart Brand back in the nineteen eighties. It's a complicated idea that typically gets boiled down to the saying information wants to be free. That's actually an oversimplification of what Brand was really communicating. But his point was

19:33

that information's value is kind of like a paradox. The information could be incredibly valuable, right, it could be absolutely critical, and therefore it could be expensive, but the cost of distributing information was consistently declining. It was getting easier and cheaper to share information, and the benefits of making information accessible are typically pretty tremendous. But information is only accessible if someone is able to hold onto that info. Otherwise

20:03

it's lost. Right, The Internet was such a volatile thing that there was no guarantee that what you saw today would be available tomorrow. In the days before the dynamic web, it wasn't really unusual for someone to establish a web page, to publish that page, and then later on to wipe the slate clean or you know, otherwise alter vast portions of that page in order to use that same web landscape to host a totally different document. So the old

20:31

stuff would just disappear. And so Calee and Gilliat created the Internet Archive, a nonprofit organization dedicated to the archival of information across the Internet. And I think most people are familiar with it from the web wayback machine, but that's just one part of what the Internet Archive does. As stated in the Library of Congress, the mission of the Internet Archive was quote offering permanent access for researchers, his story and scholars to historical collections that exist in

21:03

digital format end quote. Cale and Gilliat founded the Internet Archive the same year they founded Alexa Internet. So that's nineteen ninety six. And it wasn't easy. And why is that? Well, you got to think about the challenge you face if you want to archive everything on the Internet, or at least everything that you're allowed to archive on the Internet. We'll come back to that a couple of times. So, for one thing, you need to create a way to capture the content of a web page and to preserve

21:31

that for posterity. And you need a way for people to access those archived web pages and to navigate them. So Alexa Internet would end up developing these technologies and commercializing them in various ways, and the Internet Archive was made possible through these tools. So you could think of Alexa Internet as being the funding machine for Internet Archive in the beginning, at least as far as the tools

21:58

Internet Archive would use in order to achieve its mission. Now, on the capturing front, Alexa Internet created a web crawler. So for applications like web search engines, primarily web search engines, web crawlers are the soldiers that they send out. A web crawler's job is to index content across the Internet and to capture information about what the various web pages

22:22

on the Internet are actually about. It's complicated, right. You could just have a directory of web pages that's based off the title of the web pages, but title and content are not always in alignment. So web crawlers are all about following the various branching pathways across the web. They crawl through the web, in other words, indexing every page as they do. So. Not everyone, however, wants their

22:47

web page indexed. So you can actually include some HTML language in your web page that indicates that it's off limits for indexing, and appolite web crawlers such as the ones that Alexi Internet was using, will honor those instructions and it will not index that page. But other pages that lack this specific instruction of hey, don't index this, they're fair game. I like to think of web crellers kind of like Doctor Strange from the Marvel Universe the

23:18

Cinematic Universe in particular, they all want. He uses his time manipulation abilities to see where all the different possible pathways can lead to. The web crellers do that across the web. They explore all the nooks and crannies. They follow each link that even the ones that no one ever clicks on, they follow those two. And you know, hats off to web crellers for doing that to build out these indices, because without it, web search wouldn't work,

23:44

and Alexa Internet wouldn't have been a thing anyway. Alexa Internet and by extension, the Internet Archive used several different web crallers over the years, but they all basically do the same thing, or they they you know, more accurately. They all aimed to achieve the same results. So the crawler starts with seed URLs. This is like the starting point where you let them go, and then they follow each link and they download documents to the archives servers.

24:12

The crawlers also reference the links to ensure that they're

24:15

not double dipping on a specific crawl. So if you have a ton of different sites that are all linking to the same document, like let's say that someone has published something, and hundreds of other resources on the internet reference that published document, Well, That means there's all these different pathways that lead to the same destination, right, and it would be somewhat wasteful to capture this exact same document multiple times during the same crawl, so there's cross

24:48

referencing that happens in order to prevent that from occurring. This process does work, but it also has limitations. So for one thing, these crawls they do create snapshots of the web in intervals, So if you use the wayback machine, we'll talk more about that in a second. You'll see that the history of a web page consists of a series of dates from which the Internet archive first received a snapshot of that page, and it leads all the way up to the most recent reference of that page,

25:17

the most recent snapshot. The various dates and the wayback machine are not necessarily relevant to any major changes that happened on the web page itself. This is just when the web crawlers went to that particular web page. So it may be immediately after a massive change has been implemented, it may be well after. In fact, there might be a point where between webcraller visits a web page has

25:42

changed a couple of times. Well, that means that the ones that are happening in between those changes aren't going to be captured. It's just whatever was there the first time the web crawler came through, and whatever was there

25:53

the next time the web craller came through. So interesting thing is that if a particular page does have a ton of other links pointing to it, that page is more likely to have very frequent snapshots throughout its history, because again, through subsequent crawls, there are various routes that take web crallers through that web page, so they're more

26:15

likely to capture a snapshot of it. For pages that have fewer links pointing to them, maybe there aren't that many other web pages out there that cite this particular page, they're more likely to have sporadic updates throughout their history. You might pull up a page in the Wayback machine and see that there's only maybe half a dozen captures of that particular page, and that means that there could be a lot of changes that were missed in between visits.

26:43

So not everything gets captured in the Internet archive. I think that some people work under the mistaken presumption that anything that was ever published to the web is captured and archived. There that's not the case. It's whatever was there when the web crawlers came through it. So, because even the Internet Archive is not a perfect record of everything that's ever happened on the web, other elements, like I said, could also be lost to time due to

27:09

the complexity of web navigation. For example, so when web designers started to incorporate things like flash, which really is no longer a thing but it was for a while, or JavaScript, then the web callers that were being used to index the web, a lot of them just couldn't navigate these types of tools that were made through flash or JavaScript. So while human users could, and they could, you know, interact with interfaces that had these tools created

27:39

through these various methods, web collers couldn't. And that meant that if a website used like tools that were made in JavaScript to act as the interface, the web creller might only be able to index the homepage, but not any of the other links branching off from the homepage because it couldn't navigate that same interface. So there's a lot of stuff from that era that's lost to the Internet Archive as well, simply because the crawlers just could

28:07

not navigate those pages. They were never captured. And like I said, if you happen to have the instruction, the HTML instruction not to index the site, well then that's not going to be there either. Now let's move on to another challenge, which is the storing of these files. Indexing everything was one thing. How do you store everything that can be indexed on the web in an archive? That's what we're going to come back and explore after

28:36

we take another quick break to thank our sponsors. Okay, so the Internet archive, how do you store all the information that you find across the web. Well, the big one for web pages was that you had to figure out where do you store and how do you organize snapshots of the web so that one you have a record of them, and two you can find what you're looking for. You can navigate to the specific instance that you're looking for. Keep in mind again, the archives not

29:16

capturing everything. As I said before the break, there's a lot of stuff that web crawlers could not access for one reason or another. Those things would be either off limits or inaccessible and thus would not be in the archive. But everything else was still fair game. So to store and organize everything, Alexa Internet created a new file format called an ARC file. ARC ARC files contain information about all the stuff that's inside them, the metadata of the Internet.

29:46

So again, metadata is data about data. It makes the small files inside the larger ARC files all self identifying, so there's no need to actually build out an index. The self identifying information includes stuff like the URL for the file, like what the URL for that particular document is, how big the document is when it was retrieved, and other stuff like that. Each ARC file would have a capacity of around one hundred megabytes, and it was possible

30:13

for a single website to span multiple ARC files. I mean, there's some big websites out there that have been around for a long time, so yeah, sometimes a single ARC file would just be a portion of that website. At first, the Internet archives stored all this information on magnetic tape, So you would do this indexing of the web, all these snapshots, and you would save it to magnetic tape. I remember I used to work for a company, a

30:40

consulting firm that had magnetic tape backups. So it was my job, one of my jobs to occasionally back up all the data on our network to tape, and I would have to swap tapes out and label them and everything and archive them properly. The Internet Archive worked under the same idea. It would capture a snapshot of all the files across the web, save them to tape, and that was how the Internet Archive kept track of things

31:09

for about three years. But eventually activity on the Internet was such that that was not going to do it. There were too many users who wanted to be able to access things that were stored or saved within the Internet Archive, and this method just couldn't keep up with demand and necessity, as we all know, is the mother of invention. So the Internet Archive needed an alternative way

31:33

to store these snapshots. And of course, the Web was really growing dramatically, which is putting it lightly, and there was a real need to step things up considerably. So to that end, the staff at Internet Archive developed a storage system they called the PetaBox PetaBox, and it was called the PetaBox because it could house a petabyte of information.

31:55

A petabyte, in case you're curious, is a million gigabytes. Now, the most recent data I have about the PetaBox storage system actually comes from December twenty twenty one, so it's a few years out of date. But at that time, the Internet Archive was using two hundred and twelve petabytes of storage, which is a lot that wasn't all the Wayback Machine. However, only around fifty seven petabytes of that

32:20

was for the Wayback Machine. The rest was for other things like archiving various forms of digital media as well as what Internet Archive references as quote unquote unique data. Anyway, the page on Internet Archive site says that the data centers there are four of them that house the petabyte storage system, don't use air conditioning, which helps keep electric

32:44

bills down. They actually let the heat from the data storage devices provide heating for the buildings that they're stored in and that you know, this is all part of a strategy to keep things at low cost but high usability and high efficiency. So that's really the big requirements for the PetaBox system. It has to be efficient. It cannot require too much power to operate any single PetaBox. Another requirement is that each rack of hard drive storage

33:17

has to hold a ton of hard drives. We're talking like one hundred plus terabytes worth of hard drive space. Another requirement is that to serve as an administrator, it needs to be easy like it can't be complicated to administrate this storage system, and according to Internet Archive, the structure of this is such that you need about one administrator for every petabyte worth of data, so you know,

33:44

that's like two hundred administrators. Essentially, the whole goal was to create systems that were relatively inexpensive, relatively efficient, and relatively easy to use. At least from an administrative perspective. That's really tall order. It's hard to meet all those but the folks at Internet Archive made it happen, and it was such a useful approach to storage and to being able to organize the files within storage so that you didn't have to build out indices that ultimately Internet

34:14

Archive would deploy this same strategy for other organizations and institutions. Okay, but that's all about, you know, collecting and storing all the information across the Internet. How do you access it? How is a user? How is a researcher? Are you able to tap into this? Because again, unless accessibility is easy, then there's not much point to doing this. You're just

34:42

making a record that nobody can reference. Well, I would argue the most famous of the ways to access information contained within the Internet Archive is the wayback machine, which is specifically for web pages. The Internet Archive first introduced the wayback Machine in two thousand and one, and the way it works is pretty simple. There's a little it's kind of like a search bar, but it's a urlbar.

35:07

You put in a URL for the web page that you're interested in, and the wayback machine pulls up the snapshots that are contained within the archive if there are any snapshots. As I mentioned earlier, not everything is in there, but if it is in there, you will see options available to you to look at the page at different points in history. One thing I like to do is look back at how famous web pages have changed in

35:32

their design over the years. If you put in something like really big like CNN dot com, you can see how the look and interface of that site has transitioned during different eras across the web. I also used to do this with the old website I worked for houstuffworks dot com. I mean that's where tech stuff gets the stuff and its name is from HowStuffWorks dot com. I like using the wayback machine to look at what the site looked like when I first joined, which was a

36:00

in February two thousand and seven. In case you're curious. It looks entirely different now than how it looked back then, and through the wayback Machine you can see what it looked like back then. Also, these days, the wayback machine is the only way I can see some of the articles I wrote for that site, because the articles have been either deleted or more likely rewritten over the time. Now.

36:23

To be fair to how stuff works, a lot of my writing was in the computers and electronics sections, and obviously things change in those fields very quickly, and something that was relevant fifteen years ago is definitely not relevant today. So you have to replace old stuff on a regular basis. But it is kind of sad that a lot of my work, a lot of my work for the first you know, ten years of my career doing this kind of stuff, is not accessible unless you use something like

36:52

the wayback Machine. Now, one super neat thing about the wayback machine is that you can still follow links that are on pages, like if the archive has those linked assets also in the archive, then you're going to be shown a record, and the record will be one that was captured closest in time with the first page that you were originally on. This sounds complicated, Let me give

37:15

an example, it makes it way easier. So let's say that I visit the web capture the snapshot for HowStuffWorks dot COM's homepage on February nineteenth, two thousand and seven. By the way, this snapshot on feb nineteenth, two thousand and seven is the closest date to when I started working at that company that's in the archive. The actual date when I started the website was not captured on

37:41

that day. Anyway, By clicking around on this homepage, I can actually follow links and it'll pull up archived links of archived articles, which is really neat. And when I did that, at one point, I clicked on a link for more information or related articles to how helicopters work. That page, the related page was actually archived on February twenty second, two thousand and seven. So one was on February nineteenth, the other was February twenty second, but the

38:12

link still worked. Right. Yes, these were two different pages that were archived on two different days, but the nature of the archive allows those links to still work between the two, which is neat because I'm not just popping around through a web of links. I'm also kind of time traveling, right, I'm looking at a timeline of snapshots that are all still interlinked together, even if they were

38:39

captured on different days. I think that's really cool. Now it gets even more cool when you think about the

38:45

scale of this project. So, according to the Internet Archive itself, the archive contains eight hundred and thirty five billion with a B web pages, And as I mentioned earlier, that just makes up part of all the data that's stored on Internet Archive servers, because the organization is also home to more than forty four million books and other texts, fifteen million audio recordings, more than ten million videos, and more than a million different pieces of software. Again, some

39:15

of this stuff might not be recorded anywhere else. There may not be duplicates or copies of some of this stuff anywhere else. While you might have things like Blu ray DVDs or whatever of some of those videos, others might not have anything. And history is filled with instances of media companies generating stuff or others, you know, independent people too, generating stuff but not keeping a copy for posterity, and then it's here and it's gone. Sometimes that's on purpose.

39:49

Sometimes it's a statement, like you make something ephemeral for that very reason. Other times it's out of convenience, Like there are stories about how the BBC would regularly reuse tapes and tape over previous programming because there was no thought about preservation or a home theater industry. So there are entire eras of stuff like Doctor Who that are just gone or believed to be gone because the BBC would just tape over old tapes and so you lost

40:25

whatever was on there originally. That's why things like the Internet Archive exist is to avoid that in the case of stuff that's stored across the Internet, to make sure that there is an accessible record of those things and that they don't just disappear. In two thousand and seven, the state of California recognize the Internet Archive as an official library, which was important it's not just an honorarium.

40:49

It would allow the nonprofit organization to receive federal funding, which is a pretty important development for the longevity of the program. But while the usefulness of the organization is beyond question, the methods that the Archive has used this have not always been met with universal approval. For example, recently, the Internet Archive has been embroiled in a pretty nasty lawsuit.

41:10

It's called the Hatchet versus Internet Archive suit, and it revolves around a group of publishers that object to how the Internet Archive scans physical books for the purposes of lending them out as digital copies. Publishers are in the business of publishing and selling copies of books, but for years, libraries have existed in order to get copies of various

41:32

books and to make them available for lending. So libraries have to purchase the books or have them donated to the library, and then makes those books available to lend out to members of the library. The Internet Archive has a controlled digital lending program to handle this sort of thing, only we're talking about digital formats, not a physical copy

41:54

of a book. This is where things get tricky because obviously, if you, as a American citizen at least, if you go out and buy a copy of a book, you can do whatever you like with your copy of that book, apart from making your own copies of it and then selling those. You can't do that. That's copyright infringement. But if you own a physical copy of a book, you can. You can keep it for yourself. You could lend it to a friend and let them read it, they return

42:22

it to you later. You could give the book away. You could resell your copy to someone else, even if you're selling it for a fraction of what the book is going for in bookstores. You could do that. You could even burn the darn thing if you're so inclined. Just don't do that. Don't burn books. But all of those things are permitted with your personal copy of the book. However, a digital copy, well, now we're starting to talk about different rules. So yes, you can lend out a physical

42:49

copy of a book. That's allowed. That's fair use. But actually it's not even fair use. That's under laws of property. But we won't get into all that. A digital copy is a lot trickier because it's easy to replicate, much easier than replicating a physical copy of a book, and so different rules have developed to handle digital information compared

43:11

to stuff that's in our physical meat space. So this lawsuit argues that the Internet Archive first digitized physical books without permission from the publishers, and that that was problem number one. There's been some different arguments about that, like if there was no ebook equivalent of the copy of the book, if the publishers had not digitized that, that's slightly different than if the publishers also offer an electronic

43:37

version of the physical books they sell. But the other problem is that the Internet Archive received donations and funding that in part stemmed from the practice of lending out digitized books, So the publisher said that made the Internet Archives activities a commercial enterprise. In twenty twenty three, a judge found in favor of the publishers, saying that the Internet Archive failed to argue that their work fell under

44:00

the principles of fair use. Again, getting into fair use, that's a whole thing, but generally speaking, fair use covers a relatively narrow set of use cases in which the copying are the use or the distribution of a copyrighted work does not count as copyright infringement. But it has to meet certain criteria, and it's only ever decided in a court of law. It's not something that's just you

44:23

can apply to proactively. It's something that you use in a defense if you're brought up on charges of copyright infringement. So by the time you're actually talking fair use, it's already pretty late in the game. But anyway, this particular lawsuit is under appeal. The Internet Archive recently made final arguments in the case I have not seen anything about the case being decided one way or the other since then, so I'm not really sure which way it's going. Again.

44:54

I didn't see anything about a decision made, but then most the articles about this are about the initial trial that happened in twenty twenty three, so hopefully I will find some follow up on this at some point. But there's no denying the Internet Archive has done a tremendous amount of work in the field of knowledge preservation and knowledge accessibility. Without the Internet Archive, there's no way of knowing how much information would be lost to us forever.

45:21

Stuff that could have been incredibly useful or even just diverting could be gone, and we'd never have a way of retrieving it again. And I am very thankful that an organization like the Internet Archive exists. If you're not familiar with it, if you never used it, I recommend you check it out and explore the Internet Archive. Look at some of the things that are in that archive, like some of the books, some of the recordings. There's

45:47

some great stuff. I think there's like a quarter of a million live performances archived just on the Internet Archive, like live music performances. That alone is super cool. Anyway, I hope you found this episode informative and entertaining. I hope you check out Internet archive. I also very much hope that you are all well and I will talk to you again really soon. Tech Stuff is an iHeartRadio production.

46:20

For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts, or wherever you listen to your favorite shows.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript