A Small Episode About Big Data

Speaker 1

00:04

Welcome to tech Stuff, a production from iHeartRadio. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. I'm an executive producer with iHeart Podcasts and How the tech are You. So early on in the days of tech stuff, back when I was still a staff writer for a little website called HowStuffWorks dot com, my boss Connell Burn, who is now a big shot over here at iHeart, he came over to me with an assignment. He wanted me to do some articles in some episodes

00:40

about this buzzword concept called big data. And I had heard the term big data, and obviously there's a pretty darn good hint I want. Big data is all about just in the nature of the name itself, but beyond that, I didn't really know much, so I jumped to it. And the interesting thing is that since that time, the

01:02

discipline of big data has evolved significantly. When I was first working on my articles and episodes, we were mostly talking about how technological tools made it easier to collect vast amounts of information very quickly and to store it. But we didn't necessarily have equally sufficient tools to do anything useful with all that information. We have or at least those tools weren't widely known and understood beyond a

01:29

certain circle of computer scientists. Flash forward a few years, and we'd see companies developing new methods to analyze large chunks of data. Oh, by the way, I do the weird data data thing, and there's no rhyme or reason to it. I don't even know which one I'm going to say before I say it, so I apologize because

01:49

I know it's irritating. It irritates me too, Anyway. Other companies sprung up with products that were meant to help with data analysis, and it seemed like we were going from an era of well, now I have all this information, what do I do now, to an era of I have discovered cryptic secrets that were hiding in plain sight thanks to data analysis, and that somehow it all happened overnight.

02:13

So today I thought we would actually look back over the history of the big data concept, how various systems have made it possible to sift through seemingly meaningless information in order to find nuggets of wisdom, and why we might not always be able to trust the answers that we discover. So the history of big data starts in the twenty tens, or maybe it starts in two thousand and five, or maybe in nineteen ninety, or maybe the

02:42

sixteen hundreds, or maybe nearly twenty thousand years ago. You might have already picked up on the fact that folks don't quite agree on where we should start when talking about big data. But that makes sense. Ever since humans have started to write stuff down, we've been pretty darn invested in the collection and then the classification of information.

03:03

Whether it's to figure out the best time to sew or harvest crops, or keep track of how much we've traded with that other band of neair dwells who live on the other side of the holler, or we just want to make a record of how great it was that we kicked the butt of that mastodon, real good. We've been really obsessed with data and collection and retrieval. Now, this obsession also means that we had to come up

03:30

with various ways to store and analyze this information. Raw information doesn't do anyone much good, and so throughout antiquity we came up with means of recording and storing and making use of information. Not only did hardworking humans create libraries where we could gather all this knowledge and then lose some of those libraries along the way due to the fact that we humans also are pretty stupid and we end up having disputes that involve burning each other's

03:59

stuff to the ground. Yeah, I'm still bitter about certain libraries being destroyed over in antiquity, but it means that we also had to come up with methodologies to categorize and classify information. Otherwise you may as well just have a big old pile of scrolls or books or whatever, and then people just you know, have to sort through them and see if they can find anything, which actually

04:23

sparks two different memories in my head. One is that there used to be a used bookstore I would go to here in Atlanta, and often the used bookstore was completely unorganized, right, Like you literally could go through a bookshelf and it's just going to be books that are more or less the same size, but otherwise there's no rhyme or reason as to why they were put there, and it was like you were on a treasure hunt. And then I'm also reminded of a naval museum in Appalachicola, Florida,

04:55

which is on the Panhandle. I went to this little, you know, naval museum, like a ship museum, and I reminded that all the exhibits were kind of in a pile on the floor, and you would literally pick things up and look at them. And that's kind of what it would be like if we didn't have these means of classification. Once you get to a certain size, like that little museum in Appalachic Cola wasn't so big as

05:21

to be a problem. But if you're talking about a big library, obviously, if you want anything useful, you got to come up with a way of classifying all this. To that end, ancient folks began to develop a science called taxonomy. And this isn't when you stuff dead animals so that they look like they might still sort of be alive. That's taxon dermy. No. Taxonomy is the science of classification, and it's perhaps best known in the field of biology, thanks in large part to a Swedish scientist

05:51

from the eighteenth century named Carl Linnaeus. But there are many applications of taxonomy that extend beyond biology. It's just the biological taxonomy is the one that I think most of us are familiar with because most of us were taught it when we were going through basic biology science. But the ancient Greeks made some early progress on developing systems of classification, and obviously, within modern library science, taxonomy is an important discipline, though oddly enough, you could say

06:20

taxonomy in library science is distinct from classification. When I was looking this up, I found resources for library science that made these two distinct disciplines. Classification was one in taxonomy was another. Now. This is because there are various methods of classification in library science. The one that I was most familiar with when I was growing up was the Dewey decimal system, which I don't even think is the dominant form now, but it was when I was

06:48

growing up. And it's meant to connect a specific work to a specific physical location in a library for the purposes of, you know, checking down the book, right. But taxonomy in library science tends to be more towards metadata or data about data. In fact, metadata plays a huge part in big data. Oh man, I did it both ways in one sentence. I feel awful. Anyway, the information about information can be as useful as the information itself.

07:17

In some cases. I have often talked about this with personal information about how info about info can give you a lot of insight into a person. Maybe you don't have a person's name, but you have a couple of different data points about that person. In some cases, you can actually narrow down the identity of the person you're thinking of just by looking at this metadata. You don't even have to see the information about them, which shows

07:43

you how powerful metadata can be. So you start to see a cascading effect here where you slowly realize that you actually have access to even more information than you first anticipated because you also have information about that information. It gets pretty wild. Another important development in the history of big data is the creation of statistics. So let's give the Merriam Webster definition of statistics. Shall we just

08:05

have a baseline. It is quote a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data. End quote. Now. One famous early example of statistics comes to us courtesy of a fellow named John Grant Graunt. He was looking at mortality rates in London, and that gave him a lot more information and helped him analyze the course of the plague. For example, he

08:38

could see when the plague was spiking or receding. Pretty cheerful stuff, right, But he also used this information, the mortality information, to start drawing some conclusions about the population of London as a whole, So counting up everybody, like figuring out who lives in London. That would have been

08:56

challenging at the time, to say the least. But Grant took information like the number of funerals and then he compared it to things like the average family size in London to try and make an estimate of London's population. So it gave him kind of a working figure that was useful for certain applications, specifically government ones. Statistics as a branch of mathematics would mature over the following centuries.

09:21

Often it would be the tool that allowed social scientists to draw broad conclusions about large populations, but others found plenty of alternative applications of statistics. Anyway, the age of data analysis was well and truly in swing at this point in the late nineteenth century. The United States was

09:42

getting in a bit of a pickle. And I know we're making jumps of centuries here, but we need to We can't go through every single evolution of data collection and data analysis that would be a podcast series all in itself. So we're in the late eighteen hundreds and the USA US isn't a bit of a problem. The country holds a census every ten years, where they're essentially gathering information about all the citizens in the United States.

10:09

This is required by the US Constitution, and there are several reasons why the Census Bureau holds a census every ten years. But one of those reasons is that the US House of Representatives its membership depends upon population. So the more populous a state is, the more representatives that state has in the House of Representatives. So if your state has a big population, there are more representatives that

10:35

go to the House. If you have a relatively small population, then you have fewer House representatives, right, That's how that works. So by eighteen eighty things were getting to a really difficult situation. The process of collecting and then analyzing all the information was so cumbersome that it would take nearly the whole decade just to get to a result, and that means by the time you're drawing conclusions, it's actually

11:05

time for you to administer the next census. In fact, they projected that in eighteen ninety working on the same process that they were dependent upon previously. It would take a whole decade, so literally you'd be holding your next census while you were just getting your information from the last one. So the Census Bureau needed a way to collect and analyze this information in a much more efficient process. They tapped a man named Herman Holleeth to accomplish this.

11:32

So Holloweth took a punch card system that had been used in weaving, weaving with mechanical looms. I've talked about this in the past with the history of punch cards. In fact, this also gets into perhaps a somewhat apocryphal story of where the word sabotage comes from, but that's

11:51

for another time. So he took this punch card system that had been used to set weaving patterns with mechanical looms, and then he adapted that to serve as a way to record information so that you could feed the card to a tabulation machine which then could actually tabulate the results. And his invention meant that ten years of labor done by clerks who are working at desks would actually boil down to about three months of labor using the tabulation machine. Obviously,

12:21

that was a huge improvement. Hollerith formed a company that over time would evolve into one of the most famous companies in all the world, Kentucky Fried Chicken. I'm just kidding. It wasn't KFC. Instead, it was IBM. That's the company that would grow out of Hollowarith's company that he founded in the nineteenth century. Anyway, we're not going to spend too much time in all these centuries gone by. We're actually going to speed things up and get up to

12:50

the twentieth century. But before we do that, let's take a quick break to thank our sponsor. We're back, okay, So the actual term big data is still waiting for us. We're not going to really get to that until we hit the late nineteen nineties or so. But there are a few things to point out before we get up

13:20

to there. Folks were starting to notice that we were generating, collecting, and storing an awful lot of information in the twentieth century, and that the rate of data generation was on the rise. Not only were we generating a whole bunch of information, we were doing it in larger amounts year over year. In fact, it was rising much faster than our rate of consumption of information, meaning that we were making way

13:47

more data than we were actually able to use. And a big thanks goes out to Forbes for an article that's titled A very Short History of Big Data by Gil Press. A lot of the information that I'm drawing upon came from that article. It is fantastic if you want to learn more about this. I'm not going to cover every element that they do. I mean, that would just be me regurgitating their article. You should check it out if you're interested in the history of big data.

14:13

We're going to touch on a few of the important points, or what I think of as the important points. So one of the earliest ones we're going to talk about is in nineteen forty four, a librarian named Fremont Writer, which is a fantastic name, wrote a work titled The Scholar and the Future of the Research Library. So Writer made an observation that reminds me a lot of Gordon Moore's famous Moore's law, except this involves not silicon chips

14:40

but physical libraries. So Writer said that your typical library in your typical American university was doubling in size every sixteen years. He projected that this would mean that by the year twenty forty, the library at Yale University would be so large as to require a staff of more than six thousand people to manage it. Of course, this was before we had digital storage and digital filing systems

15:08

that has largely mitigated this particular requirement. We don't need the physical space necessarily that we would if everything were still in hard copy. But the observation showed that data accumulation really had a steep trajectory even back in the nineteen forties. Similarly, in the early nineteen sixties, a guy named Derek Price published a piece explaining that the number of scientific journals and papers was on a path of

15:35

exponential growth. It was doubling every fifteen years, so similar to the rate at which university libraries were doubling in size now. Part of the reason for this, he said, was that scientific discoveries inevitably fuel further discoveries. So you find out something new, this inspires other scientists to look further into it, they find other new things, and so on.

15:56

In nineteen sixty five, the United States government needed to build a place that would store records, including things like tax returns and fingerprint sets, and so the plan was to take the paper records and then transfer them to magnetic tape, and then to store that magnetic tape in this so called data center. This project fell through, however, because the public got nervous. They felt squiky about this idea of the government hoarding vast amounts of information about

16:25

its citizens. They did not fully trust the government. So you understand like they're thinking, I don't really feel comfortable with you just gathering all this information about us. It feels kind of oppressive. Now, what's funny to me is that today the average person is more than willing to let companies do this to them without even protesting it.

16:44

Because that's how all the online social network companies work, right, They work on the basis of gathering information about us and then peddling that or or hoarding it, however you might think of it. And it's very similar to what was happening in the nineteen sixties. And back then we were like, no, that's not cool, and now we're like, that's just how it works. It's wild to me. Anyway, I'm going to skip ahead a little bit to the nineteen eighties. There was a lecturer, I a Tjomslend, and

17:14

I know I butchered his name. I apologize anyway. He gave a lecture at the IE or IE Symposium in which he posits that one reason all this information is piling up is that we don't really have a good way to determine which information is relevant and which information is not. And we can make that determination, but it requires work, and meanwhile, we're still accumulating more information. So it's the kind of work where you're never done, and it feels like you're never making any progress. So most

17:47

of us never bother to do it at all. And if our ability to store data is sufficient, in other words, if we have ways of storing the information, then we have even less incentive to make any determination about the data. Right, Like, if we've got plenty of storage, well, let's just go ahead and keep the information. There's no reason to have

18:06

to worry about it whether it's useful or not. We should keep it because it's better for us to keep useless information without needing it, rather than accidentally deleting something that turned out to be important. Right, And this kind of makes sense. I mean, I'm sure a lot of

18:23

you out there can apply that to your lives. I certainly can apply it to my life, right Like, I have file folders that are full of stuff that I'm never going to touch again, but I still feel reluctant to delete it just in case I do need to touch it again sometime in the future, even though the likelihood of that is very low. So that's anecdotal. I can't really call that evidence to prove the point, but

18:46

it feels like the point is relevant. So this is also how I play a lot of those big open world computer RPGs, by the way, things like Skyrim or whatever, because I'll just hoarde potions and scrolls and I never use them because what if I need it more in the future. Balder's Gate three has really done a number on me with this. I got a real problem with

19:07

that anyway. The Forbes article details several more entries indicating how very smart people were taking note regarding the accumulation of information, as well as methods to store the information, and increasingly, as time went on, how we can do useful things with all this information. So I recommend you check out that Forbes article if you want to learn more.

19:27

I think it goes up to about twenty twelve at this point, it has been updated numerous times, but obviously twenty twelve was quite a long time ago, so it's it's not exactly up to present day. But it's still a really interesting article that gives lots more details about this. But I don't want to just regurgitate the article, so we're going to hop on ahead. Now, Folks, in general, we're becoming more aware of this information challenge that was growing.

19:55

But where did the term big data actually come from? Well, chances are it's sort of rose organically in conversations within the computer sector. As you know, hackers and computer scientists and programmers and researchers were all wrestling with ways to deal with data. Now, by this time, folks had adapted an observation made by Cyril Northcote Parkinson to apply to

20:19

computer systems and to information. So Parkinson's original observation was that generally speaking, in public administration offices, you know, like government offices, work expands to fill the time that was allowed for that work. So if you have a project that's going to be due in three weeks, but really, if you were to be brutally honest, there's only a week's worth of work to do for that project. Well, that work will almost magically expand so that it actually

20:48

takes three weeks to complete. This gets more nuanced and it brings into account elements like bureaucracy. But you get the point right that somehow it doesn't matter, you know what, who is working the job. It doesn't matter the nature of the work. The work will expand to fill the amount of time it requires to do that work, which meant that if you had said it would take two weeks, it would have just expanded to two weeks, not three.

21:13

It's very weird, right Anyway, Folks in the computer biz adapted this to say that data will expand to fill whatever space you have available for that data. So, in other words, you make a bigger storage unit, you're going to fill it like that data will just expand to fill that even though you thought, oh, I'm future proofing this, and again anecdotally, I have observed this in my personal life.

21:38

I remember when hard disk drives first became a thing in personal computers, like they were already existed, but personal computers didn't have them when they first it came out, right, you were using external drives like floppy disks and stuff, and I remember whenever there would be a dramatic expansion of storage space, and it always seemed to be dramatic, right, it always seemed like it had doubled since last time.

22:00

And typically that's how it worked. Anyway, I would walk away thinking, Wow, I'm never gonna fill all this space. I mean, who even needs that much space? Two hundred and fifty six megabytes? Who the heck needs that much space? That's way too much. I mean, I'll never fill it up. But of course I would prove myself wrong, typically in record time. But beyond anecdotes, which again don't really count as evidence, the observation really pointed out that we will

22:25

eagerly fill up whatever space we're given. You could argue this goes back to our tendency to avoid deleting material out of concern that it might one day become useful. Anyway, By the mid nineteen nineties, there was a computer scientist named John Mashie, and he was giving presentations that related to this concept of big data. Now, Mashie has dismissed

22:48

the idea that he personally coined the phrase. At most, he says that he popularized the term big data in his talks but his point was that he used the phrase big data because it was a shit shorthand way to give a nod to several related challenges, ranging from storage to analysis. So one could argue that Mashie's use of the term approached what we mean by big data today, but it wasn't one hundred percent the same thing. And the earliest use I've seen cited happened sometime around nineteen

23:18

ninety eight. So we know Mashie didn't invent the phrase, and we know that partly because researchers found an instance that predates his talks by nearly a decade. Steve Lohr wrote a piece for The New York Times titled the Origins of Big Data, An etymological detective Story. A great,

23:37

great article. By the way, Lore spoke with an associate librarian in Yale Law School named Fred Shapiro, and Fred Shapiro did some research and uncovered an instance of the phrase big data in a nineteen eighty nine article in

23:51

Harper's magazine. The author of that piece was Eric Larson, who said, quote, the keepers of big data say they do it for the consumer's benefit, but data have a way of being used for purposes other than originally intended quote, and boy howdy, we have seen that observation play out again and again, haven't we. It's remarkable because nineteen eighty nine predates the World Wide Web, certainly predates all the

24:15

social networks that we talk about. But Eric Larson's observation is just as relevant, if not more relevant, today than it was in nineteen eighty nine. Also, incidentally, Eric Larson wrote one of my favorite books of all time. It's

24:29

titled The Devil in the White City. Famous book. I'm sure a lot of you have already read it, but for those who haven't, it's a book that tells two somewhat intertwined stories, the eighteen ninety three World's Columbian Exposition in Chicago and the tale behind HH Holmes, credited as

24:47

one of America's first serial killers. Now, I originally bought the book because I was interested in Holmes's story, but I got to be honest, I actually found the chapters about the exposition to be far more captivating, and it ties up into a lot of the stuff we talk about on tech stuff. So it's a great book if you're looking for something to read. But now let's get back to Big data. So things continue on their inevitable path through time. As it goes, time marches on, we

25:14

get up to the two thousands. By now the Internet has greatly exacerbated our data creation and accumulation problem. In two thousand, Francis Diebold wrote, quote big data refers to the explosion in the quantity and sometimes quality of available and potentially relevant data, largely the result of recent and unprecedented advancements in data recording and storage technology end quote. So we're really starting to close in at this point on the concept of big data as we understand it today.

25:46

Then we get up to two thousand and five, and a couple actually several important things happened that year in the realm of big data. We get Tim O'Reilly and his media company, fittingly enough called O'Reilly Media, and this is the year that he would publish an article titled what is web two point zho, a famous or perhaps

26:05

infamous article in tech circles. So the dot com bubble had burst several years earlier, around two thousand and two thousand and one, and O'Reilly was making observations about the qualities that helped the companies that survived that crash versus the companies that went under, like what set them apart? What are some of the qualities that we can say are really valuable on the web. And part of that

26:27

involved how successful web ventures were handling data. Now. That same year, he had a guy named Roger Mugalus or Mugalas actually I don't know how to say his last name, but he was also with O'Reilly, and he argued that big data refers to how we now had the capacity and the capability to gather and store data sets that are so large that our traditional business tools are incapable

26:51

of doing anything useful with that information. It's makes me think of the joker in the Dark Knight film where he says, as a dog chasing a car, he wouldn't know what to do if he caught it. That kind of thing. Yeah, we've got all this information, but the tools we have aren't sufficient to do anything meaningful with it. We were overwhelmed with information. But that same year, because an awful lot happened in two thousand and five in the big data space, Doug Cutting and Mike Cafferella released

27:21

a tool that would really change things. I'll explain more, but first we're going to take another quick break to thank our sponsors. Okay, before the break, I teased that we were going to talk about a tool made by Doug Cutting and Mike Cafferella that would actually change our approach to big data and make it possible to do meaningful things with it. So these two had read papers about Google's file system as well as a tool that

27:59

Go was using called map reduce. Now, the purpose of map reduce is to take large clusters of data and essentially break them down into more manageable chunks, and then analyze these chunks in parallel, and this makes the process of data analysis faster. It's really just another form of

28:18

parallel processing when you really think about it. Anyway, Cutting and Cafarella were inspired to make their own tool that could do similar work, but you know, they can make it for everybody, and so they created a project called hadoop hadop, and the first version of hadoop would come out in two thousand and six, and it's an open source project. It's still around today with thousands of contributors.

28:42

But the important bit is that we were now starting to develop new business tools that actually could handle the massive amounts of information that we were accumulating. But let's take a quick step back. Let's also consider what's going on around this same time, the mid to late two thousands, and by that I mean the first decade of the

29:01

two thousands. So for the first several years in the computer age, it was really computer systems themselves that were seen as the genesis of data creation, right like it's the computers are the things making all this info. But other elements were starting to come into play by this point. So when we get up to two thousand and seven, we're into the consumer smartphone era, because that was the introduction of the Apple iPhone. These consumer smartphones can generate

29:26

enormous amounts of information. You can perform all sorts of computational tasks on them. They can track your location, you can connect the internet, et cetera. We also were getting into the age of the Internet of Things, so we were starting to create millions of these tiny devices, usually designed to collect specific bits of information and then zip that info off to somewhere else. So it might be a speed sensor along a road. It might be a

29:52

thermometer at a weather data collection site. It might be a thermostat in your own home. It could be anything. Could be a smart speaker. All of these individual little components would add to the amount of information we were gathering and storing and creating, all in the hopes of

30:10

being able to do something useful with that info. And we also had another buzz term that was starting to gain traction, just as big data was really beginning to transition from a topic that was talked about in a relatively small subculture of computer scientists and such into a topic that the general public had actually heard about. You know, usually we're a few years behind whatever group is really

30:36

focused on the subject matter. So this other buzz term was cloud computing, which I also got an assignment to work on right around the same time as big data. Now, the simplest way to describe cloud computing is that it's when you use someone else's computer to do your computational tasks because you log in through your computer, but it's this other computer that's actually doing the work, or it might be a net work of other computers doing that work.

31:02

That work could be that you're storing photos or ekitty cats on a drive on some cloud storage, or it might be that you're using cloud computing to help you crunch really big numbers that your computer could not handle and you're peeling back the mysteries of quantum mechanics or something. So cloud computing would rise at the same time as big data and cloud computing and big data are very closely related. They're enablers of one another in a way.

31:31

Organizations and companies feel the need to engage with cloud computing services because their data tasks are growing increasingly complex and voluminous, and it gets harder and harder to handle all of that on your own. Right, Like most businesses these days are not using exclusively on premises computing systems to do all their computation and all their storage. It just is not practical, right. You would have to continuously buy or lease more space just to hold all the

32:03

systems you would need. So instead they engage with cloud computing companies that will provide those services for them, and then the cloud computing companies will go out and build a warehouse and fill it full of computers. Big data leans on cloud computing to make it practical to even accumulate all that data in the first place, let alone analyze it. Now, the lure a big data the reason why we're concerned with it. I mentioned this in the

32:28

very beginning of this episode. The lure is that there are nuggets of truth hiding inside vast amounts of possibly useless information. There is signal, but there's also an enormous amount of noise. If we can identify those little nuggets of truth, then we can potentially benefit from them. But these huge piles of information are just so vast that our ability to zero in on the important stuff is just not up to snuff. It is the proverbial needle

32:59

in a hay haystack problem. So the promise of big data in our current age is that when we use the right tools, we can sift through the haystack and we can find all the needles, which is a really tempting concept, because who knows what you might find when you analyze large amounts of information. Maybe you identify patterns that you can then use to lead you to change things so that you can save huge sums of money in the way you do business. Or maybe you identify

33:31

a previously unknown opportunity. Or maybe you can spot connections between data points that you didn't see before and you start to see correlation. Maybe you even determine causation. Maybe this leads you to make some incredible scientific progress, and it might be on anything from medicine to astronomy. It

33:50

all depends on the type of data. Obviously, However, there's a big caveat that goes along with this sort of beautiful concept, and it's possible that the tools we use will make mistakes that they're going to spot patterns or meaning when in reality there isn't anything there. They mistake

34:12

something to be meaningful when in fact it's not. This is kind of like when you look up at the clouds and you see a pattern that makes you think of a specific shape, very like a whale, As Hamlet and Polonius would say, so, the shape of the cloud might remind you of a whale or a dog, or a hand or whatever, but you probably are aware that the cloud isn't actually a whale or whatever. In fact, you might even realize that your point of view is part of what is shaping your perception. It's part of

34:44

the reason why it looks like a whale. Maybe if you were a mile away to the east or something and you were to look at that same cloud, the angle you would be at would mean that the cloud wouldn't look anything like a whale. Maybe it would look like something entirely different, or maybe it wouldn't remind you of anything at all. So, from one perspective, the cloud shape appears to have some meaning. From other perspectives, it doesn't.

35:08

So it would be a mistake to draw any conclusions based on that one perception, because it would just be the illusion of meaning, not actual meaning. And that can happen when you're looking at huge data sets too. You might see something that looks like it's meaningful, that it represents a pattern or a connection, when in fact it doesn't.

35:28

That can lead you on a wild goose chase, and in a worst case scenario, you might dedicate a lot of time and effort and money toward pursuing this perceived meaning and you only find out much later that there was nothing there at all. Now that's not to say that we can't trust the outcomes of big data analysis, but it does mean that we have to make sure that we have tests to ensure the validity of those analyzes.

35:53

We need to take a scientific approach toward big data, or else we run the risk of chasing a dream rather than and learning more about reality. And anytime there's any uncertainty, there will be people who move in to exploit that uncertainty, hucksters, scam artists, snake oil salesman. So as an example of this, I would point to the

36:13

explosion we are seeing in artificial intelligence right now. AHI has tons of applications, including in the analysis of big data, and that means that there is also opportunity there to take advantage of people. So it doesn't take much imagination to think of a company that actually uses a cheap human labor and to pass it off as a truly AI company and to market that company's services to big

36:38

businesses that may not know any better. And really you're just exploiting people in poorer countries and passing it off as being this really high tech business. As it stands, even if you're not doing that, human labor is already the backbone of the AI industry, like it or not. People in countries that have low wages and have very little protect in place for working citizens, they're spending countless hours tagging data so that AI can actually make use

37:07

of it. So as we marvel at how clever AI tools appear to be, there are folks out there on the margins who are the ones labeling images and applying metadata to text so that the AI can grab the right stuff based upon a query. Anyway, I think it's important to remember that big data can, with the right tools, provide us insights that we might not otherwise make because the amount of information is just too large for us

37:33

to handle. Those insights might mean we can do things like streamline supply chains, or identify a market for specific product, or find a new way to treat an illness. Big data can also lead us to some darker outcomes. Companies will scrape as much of your personal information as they

37:49

possibly can. They will sell it to other companies. These other companies will market you to yet more companies on an effort to serve you ads or to lure you into doing something foolish like downloading malware or consuming misinformation. Because behind every silver lining is a big, scary cloud. Maybe it's in the shape of a whale. That is a brief history of big data. It is a history that is ongoing. I'm sure we're going to see some incredible,

38:19

incredible discoveries thanks to analysis of big data. I'm sure we're also going to see some pretty scary stuff as a result of it as well, such is life, but it is fascinating to see how we have arrived at this point, like first from the point of how do we collect all this information? And then what do we do with it? I hope you enjoyed this episode. I hope you are all well, and I will talk to you again really soon. Tech Stuff is an iHeartRadio production.

38:57

For more podcasts from iheartradioit the iHeartRadio app, Apple Podcasts, or wherever you listen to your favorite shows.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript