AI Model Collapse and the Dangers of AI-Generated Content

Speaker 1

00:04

Welcome to tech Stuff, a production from iHeartRadio. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. I'm an executive producer with iHeart Podcasts. And how the tech are you? So? Imagine for a moment that you

00:23

are in school. Some of y'all might actually be in school, but others, like me, we have to satisfy ourselves by having that occasional stress dream where we imagine that we're in school and it's time to take a final and we haven't gone to class all year, and also we can't remember our locker combination. I don't know about you, but I still occasionally get those dreams. And I'm almost

00:44

fifty years old at this point. Anyway, you're in school, you're in English class, and you've been given the dreaded term paper assignment. You're told you need to go to the library and you have to gather resources and read up and form your thesis and write your paper while making verifiable citations all the way through. So off you go to the library. However, you discover, horror of horrors, that all the resource books have disappeared. They're none in

01:14

their place. Are other student term papers? Now? Some of those term papers are pretty good, some of them are terrible. Nearly all of them do have a list of references at the end, But the problem is that you don't have access to those references. You only have access to the term papers, which, in a way, you could say is a filtered view of those references. But you have no way of knowing if the student who wrote the term papers you've pulled out did a proper citation. You

01:42

don't know if the student understood the source material. You don't know if they have made a valid reference using that source. You don't know if the student didn't understand the source and thus misconstrued the information, either accidentally or on purpose, or if the student is just outright plagiarizing the source material or making stuff up. So how do you think your own term paper would turn out? Probably it'd be a challenge to write a good term paper.

02:10

It definitely would be difficult or almost impossible to support your thesis using citations, because all you would have access to would be other term papers. Chances are you'd have a pretty lousy grade by the end of that assignment. Now, I started off this episode with that analogy because today we're going to talk about what happens when AI models train off stuff that was generated by other or sometimes

02:35

even the same but earlier versions of AI models. So when bots make stuff that other bots consume, and then those other bots make new stuff and the cycle goes on. Where are the humans in this picture. Maybe they're in an actual library, because the online resources will all have become practically useless. So if we want to actually learn anything,

02:59

we're gonna need to go back to the basics. So we're going to talk about an idea called model collapse, as in large language models LMS and other types of AI models. We're going to build to that. However, first up, let's explore the tendency of AI models to produce wrong or misleading results, regardless of whether the material used to

03:21

train that AI model came from AI or humans. This is something I've talked about in past episodes, but it's an important part to kind of build toward our understanding of what model collapse is. Now. In past episodes, I've talked about the issue of AI hallucinations, also sometimes called confabulations.

03:40

Some people prefer confabulations to hallucinations. This is the tendency for generative AI to mistakenly include untrue or misleading information, or to insert stuff that does not belong into whatever it is that's creating, whether that's an image or text or what an so. One fairly recent example of this was when Google's AI augmented search tool suggested that you add a non toxic glue to your pizza ingredients if you want to solve the irritating issue of cheese slip

04:14

sladden away off your ding dang dern pizza. Clearly this answer is not acceptable. Adding glue, non toxic or otherwise is not a way of making good eats. I'm pretty sure Alton Brown would agree with me, and actually I would argue this is one of the less egregious cases of AI providing a bad answer. It's famous because it got a lot of traction. It went viral for how bad the answer was. But in the grand scheme of things, there are other examples that were far more potentially harmful.

04:43

So why does AI do this sometimes? Well, there are a few different contributing factors that lead AI to making these mistakes. By the way, the reason why some people prefer confabulations as opposed to hallucinations. Hallucination sounds like the AI I has somehow been tricked into thinking something is what it isn't right, like the idea that you hallucinate your seeing or hearing or experiencing something that's not really there.

05:12

Confabulation suggests that the AI is inventing something. It is confabulating, it is creating an answer where there was none, and so some people prefer the second one because they but it puts more of the onus on the AI model itself. So one of the factors that contributes to AI making mistakes. And you know, large language models and like are in

05:34

part focused on pattern recognition, and this can lead to issues. Now, recognizing patterns is what gives these models the ability to form relevant and coherent responses to queries, and obviously pattern recognition is important otherwise you're just gonna perceive everything is being random and meaningless and then really, this whole conversation doesn't mean anything either, or if the whole universe is meaningless,

06:02

then what are we even doing here? But I don't want you to go down that path of existential dread. So sometimes AI will detect a pattern where there really isn't a pattern. And we humans do this too, you know, we sometimes experience like paradolia. For example. That's when we perceive something meaningful within an otherwise meaningless thing, like we see a pattern where there is none. So if you were to look at the clouds and you say that one of them looks very like a whale, that's paradolia.

06:34

It's also a reference to Hamlet the infamous face on Mars, which was really just a hill with some shadows cast on it. Because the angle of the image, that was another example of paradolia, people began to think that there was actually a big sculpted face on Mars. There's not. It's a hill. The shadows hit the hill in a specific way that made it look kind of like the face of an enormous statue, something like the Sphinx, something along those lines. But in fact it was just a hill.

07:04

And if you took another image from a different angle, which people have done, the illusion of a face disappears. So again, that was us inventing a pattern where there was none. Now, much of the time we humans can recognize when the things we see, you know, the shapes of faces or whatever it may be, aren't actually there. Right, we can recognize, oh, that looks like a blah, blah blah, but we know it's not actually a real image of that.

07:33

It just happens. Now. Sometimes we don't recognize this. Sometimes there are ties where people will assume that what they're seeing is an actual image made with intent and intelligence, perhaps not by humans but by something. So there are all those stories of people going bonkers because they believe they saw an image of like the Virgin Mary in

07:54

a potato chip or whatever. And machines don't necessarily have any checks against fall hits when it comes to pattern recognition, and then they might act on a perceived pattern, which means the machines produce bad results. What's more, machines conceive patterns where we can't. Like sometimes there are patterns present that we cannot perceive because maybe the dataset is far too large or far too complicated, and so we can't perceive where the pattern is. It's just beyond our abilities

08:27

to do so. But sometimes machines can detect those patterns, and sometimes they are meaningful. So it can be really tricky. If a machine thinks it's found a pattern, it can be hard for people to verify or discredit that because it's on a scale that we humans are not really well equipped to handle with generative AI. This can mean that the AI model correctly identifies that it needs to use a specific syntax to craft a response to whatever query or direction it was given, and it can thus

09:01

put together a sentence that grammatically makes sense. What's happening is it's essentially statistically analyzing the structure of hundreds of millions of sentences, as well as the role that certain words play within those sentences, so that it quote unquote knows how to write a grammatically correct response, and ultimately it's using statistics to pick what should be the most

09:25

correct word in each position of that sentence. So ideally, it's pulling information from various sources that are related to whatever it is you're asking about and pulling the words together in a way that makes logical sense and is accurate, and it's a correct answer to whatever your question is. But that doesn't always happen right. Sometimes it can't find the right word. Sometimes it finds a different word that

09:52

it thinks is right, but it's not. And the real problem is it will present this to you authoritatively as if the AI is absolutely certain this is the right answer, when in fact it's wrong and the AI has no way of knowing it's wrong. It's not purposefully trying to mislead you, and at least not necessarily. Maybe it was given direction to try and do that, but that's another matter. It's just trying to complete its task and failing to do so accurately. Sometimes the word or a series of

10:26

words can be wrong. Therefore, now grammatically it could be correct, but factually it could be completely made up. And why this all happens. It does get really complicated. It's not necessarily due to just one specific flaw. It's not always the case that, oh, that data point didn't appear in the data set for some reason, and so the computer made something up. There are other issues that could also be at play. So, for example, one possible reason for

10:53

hallucinations is something that's called overfitting. IBM defines this as what happens quote when an algorith rhythm fits too closely or even exactly to its training data, resulting in a model that can't make accurate predictions or conclusions from any data other than the training data. End quote. That's from a piece on IBM dot com. It's titled what is overfitting? Sometimes models get so complex or they're trained so closely on a specific data set that they start to pick

11:24

up more noise than signal. They give significance to insignificant things. I think of this kind of like the character Dracks in the Guardians of the Galaxy movies. Drags takes things literally, so if you use a saying or an idiom on him, he's likely to interpret what you're saying as being what

11:44

you mean. So if you say, oh, that's like throwing the baby out with the bathwater, he would assume you're talking about something you have literally done before in your life, that you have literally thrown out a baby with bathwater, and he would not understand you were using an analog to describe getting rid of important stuff along with the

12:04

unimportant stuff you want to get rid of. If a model has been overfitted, if it's been trained too much on a relatively narrow set of data, it might have trouble taking what it has learned and generalizing those learnings

12:16

towards something else that's outside the data set. And rather than saying I'm sorry, I don't know the answer to that, it could produce an answer that follows the statistical rules that the model is set to In other words, it'll create something that grammatically makes sense, but it won't necessarily be relevant or you know, thematically or irrelevance makes sense.

12:38

So in this way, an AI model can become like that stereotypical person in the car who absolutely refuses to pull over and ask for directions when they get lost, because that would be showing weakness. No, gush, darn. It will somehow reason our way out of taking that wrong turn forty five minutes ago. That'll fix everything. Except it doesn't fix everything, and it can make things worse. But it's not just pattern recognition that can trip up AI models.

13:02

Another issue is bias. I've talked about bias in other episodes, but it's really important that we understand what we mean when we're talking bias and how it can happen, because I think a lot of people get tripped up. They think it's a machine, right, it doesn't possess opinions. How can it have bias? Well, we'll explore that in just a couple of moments, but first let's take a quick break to think our sponsors. How can an AI model

13:43

have bias? Well, the answer is that the machines that AI runs on the algorithms that AI is built upon. All this stuff, it didn't just pop out of nowhere. Ultimately, this stuff was designed, built, and programmed by human beings. Even if you have had a piece of software that was designed by AI, while the AI that designed it in turn had been designed by humans at least somewhere down the line once you trace it back far enough so.

14:12

Human beings absolutely do have biases, and those biases can make their way into the routines and processes of machines. MIT has a great introduction to AI hallucinations and bias on a web page that has the fitting title when AI Gets It Wrong, Addressing AI hallucinations and bias now. In that article, the author points out that AI has had issues with bias for years and uses the example of image analysis. The author cites a project called Gender Shades.

14:46

This was led by Joi Adowa Buomini, and I apologize for my pronunciation of the name. But the project examined how an AI powered gender classification tool performed when presented with subjects of varying genders, ethnicities, and skin tones from the IARPA Janus benchmark A data set or IJBA. This is a database of facial images taken from various angles and lighting conditions of lots of different people. It's used

15:20

as a government benchmark for testing stuff like facial recognition technologies. Now. The project also used a gender classification benchmark from Adance, and this was in part to try and address shortcomings with the IJB dash A benchmark set. Plus due to the limitations of both of these data sets, which I'll talk about in just a moment, the project also outlines a process to create a better data set for the purposes of training technologies like facial recognition and gender classification.

15:54

The project aimed to test several gender classifier programs from companies Microsoft and IBM, among others, all with regard to quote gender, skin type, and the intersection of skin type and gender end quote. So Joy found that the data sets from IJB dah A skewed male and lighter skin

16:17

tones skewed heavily male and lighter skin tones. In fact, she said between seventy nine point six percent and eighty six point twenty four percent of all the images in the database were of people with lighter skin tones, and fewer than twenty five percent of all the images were of women or female presenting people worse, Yet, only four point four percent of all the images were of female presenting people who had dark skin Adiance's data set had

16:46

a better distribution of photos, at least between genders. Female presenting people made up fifty two percent of the images in Aightiance's data set, but again, lighter skin tones made up the majority of these images. Less than fifteen percent of all the images in that data set contained people of darker skin tones. So I'm sure you can already

17:08

see where this is going. If you train an AI model on data that has a disproportionate emphasis on certain factors, such as certain genders or certain skin tones, then you would expect the AI to be better at handling cases that fall into those categories, Right Like, if most of the data you've fed to your AI model is of men who have a lighter skin tone, then when you are serving the AI model a picture of someone who's male presenting and has a lighter skin tone, chances are

17:46

the tools going to work better. If you are instead feeding it images of people who fall outside those majority cases, the AI tool is probably not going to work as well with them, and that's exactly what Joy found in her research. She discovered that gender classification tools from all of the providers performed better with lighter skinned men than with any other group. They perform the worst with darker skinned women. Thus we have a bias in the system.

18:18

The data that folks use to train these systems had that bias, and it unsurprisingly affects how the AI does its job. Now, this isn't just a curiosity for research labs. Of course, around the world, various organizations and companies are making use of facial recognition tools and gender classification tools. There are numerous stories of law enforcement agencies getting into

18:42

hot water for relying on this kind of technology. So we know that this technology isn't reliable, particularly if someone belongs to a group that's outside of lighter skinned men, and the data being used to train these tools is limited. That's why we're having these issues, or one of the main reasons why we're having these issues. So it stands to reason we should not employ those tools for anything really at all, other than maybe working to make them better.

19:13

But we definitely shouldn't be using them for things like law enforcement, for example. At least we should not use them until we can address the problem of bias generative AI can actually have similar issues with bias that MIT. Article that I mentioned earlier in this episode cites another article by Leonardo Nicoletti and Dina Bass titled humans are biased. Generative AI is even worse. This piece appeared in Bloomberg.

19:41

So this article explores how a generative AI platform called stable Diffusion had a tendency to make assumptions based on racial and gender stereotypes, thus repeating and even amplifying those stereotypes. Nicoletti and Bass performed and in formal test with stable Diffusion, a pretty thorough one, but still informal. They asked stable Diffusion to generate images of people who were working one

20:10

of fourteen different jobs. Now, half of those jobs belonged to what they called high paying positions, like things that you would typically associate as a high paying job. The other half typically were too low paying jobs, and actually a little less than half of them were low paying jobs. Three of them actually fell into the category of crime, so like you know, thief or something like that. The two had Stable Diffusion generate more than five thousand images

20:38

total so that they could really compare. They didn't want to just create, you know, a single image each that's a terrible test. They wanted to see, all right, is this something that's actually appearing over and over again when we make use of this tool, or is it possible that you know, you run fourteen tests and it just happens to go along with racial stereotypes. Nope. They classified the generated images based off of the Fitzpatrick's skin scale.

21:08

This is actually a skin pigmentation metric that's used by dermatologists as well as like other researchers, and the scale goes from one to six, so one would be very light skinned and six would be very dark skinned. The researchers found that stable diffusion was far more likely to create a person with a lighter skin tone for positions that traditionally fall into the higher paid categories, and that it was more likely to generate someone with a darker

21:38

skin tone for lower paid or criminal categories. What's more, stable diffusion generated images of people appearing to be men or male presenting for most of those higher paid positions. It was very rare for it to generate the image of a female presenting person in the role of one of these traditionally higher paid jobs. So the AI was perpetuating and amplifying these racial and gender stereotypes. This actually reminds me of a classic riddle that was intended to

22:11

reveal bias. I'm sure most of you have heard this before or some variation. So the riddle typically goes something like this. A father and a son are in a terrible car accident, and the father tragically dies at the scene. The son is badly injured. EMTs arrived. They rushed the boy to a surgical ward. The surgeon on duty looks at the boy and says, I can't operate on him, he's my son. Well, how could that be true? Now? The obvious answer is the surgeon is the boy's mother.

22:41

And I think a lot of people arrive at that conclusion much more easily today than they did when I was a kid. Like when I was a kid, the sexist stereotype was that all real quote unquote real doctors and surgeons were men and women they were nurses or administrators. Right, That was the stereotype that people kind of believed in. But I'm sure most of y'all understood this answer, or

23:08

you've been exposed to this riddle numerous times. I mean, it is a meme at this point, but again, back in my day, a lot of folks would likely get stumped by this, or they would say something dumb like, oh, it turns out the surgeon was the real dad and the father who died at the scene had been the adopted father he adopted the boy, or something along those lines, which reveals the bias of the listener. It reminds the listener to think critically and be aware of sexist stereotypes.

23:36

So AI can produce the wrong results due to bias built into the underlying model and end up making these same mistakes right, Like if you say surgeon, it may mistakenly just believe ah, you meant man. It has to be a man that I generate in this image because the user said surgeon, so that means man. That's a real problem. With enough work and attention, we can actually create training materials that minimize bias and can help reverse this trend. But even doing that is not enough to

24:12

eliminate errors in generative AI. There are other problems we have to look out for. So what happens when you have an AI model, like a large language model, for example, and part of the massive amount of material that it's training itself on includes data sets that were generated by

24:34

other AI. When an AI image generator is pulling images that were made by other image generators and then training itself on that, or you know, even if it's pulling images that an earlier version of that very same generator had created, the mistakes that exist in those AI generated images, or you know, it's if we're not talking images like in text or whatever, those things can become like you would argue, oh, those things are noise, right, that's those

25:05

are mistakes. But AI doesn't know that they're mistakes. They don't. It doesn't know that it's noise. If you're training it on the data, it thinks it's significant. And if it thinks it's significant, it's going to incorporate it and perhaps even dial it up quite a bit. So a great way of illustrating this, in my opinion, is to talk about fingers. I mean, I'm sure all of you out there have experienced seeing AI generated images that hilariously get

25:34

the fingers totally wrong. A lot of AI image generators have real problems with fingers, So you might have folks and images who wind up with way too many fingers, like seven or eight perrand, or maybe they have not enough fingers, or maybe all their fingers are thumbs, or maybe they bend in unnatural ways, or they all look like long strands of spaghetti. These are clearly miss you know, image generators have identified fingers are appendages, and these appendages

26:05

attached to hands. But the machines don't really follow the rules when it comes to portraying those fingers, and they do the best they can, and sometimes the best they can is hilariously bad. But if image generator models train on material that was created by AI, those weird fingers are seen as a feature, not a bug. Like the AI model doesn't know, Oh, fingers don't actually look like that,

26:33

that's wrong. It just says, ah, this is how fingers sometimes look based upon these images I've been trained on, which means the next generation of image generators will stress these features more instead of correcting for them, which means you're going to get some really weird images as a result. And this process can repeat itself, and it gets worse and worse each time. It's like making a copy of

26:58

a copy of a copy. You eventually reach a point where the copy you have produced is illegible or doesn't look enough like the original at all for you to even easily say, oh, this is a copy of that. That can be a real problem. And of course this is just one example the fingers in AI. That's an easy mark to hit, right, but there are countless other examples.

27:23

In a paper titled The Curse of Recursion Training on Generated Data Makes Models Forget, a group of researchers from the University of Cambridge, Oxford University, Imperial College London, the University of Edinburgh, and the University of Toronto present an argument of a pretty bleak future if AI researchers don't take the proper measures to head it off. We're going to talk more about that in just a moment, but

27:52

first let's take another quick break to thank our sponsors. Okay, before the break, I mentioned this paper, the cursor recursion Training on Generated Data makes Models Forget. It's a great article. It does get very technical at one point, but the researchers did a great job explaining the top level problem and the potential outcome of that problem in a way that I think anyone could find accessible. When you get to the actual analysis part, that's when it gets really technical.

28:32

But the summary, the conclusions, all of that, I think is easy to understand. So in that paper, the researchers say, quote, we discover that learning from data produced by other models causes model collapse, a degenerative process whereby over time models forget the true underlying data distribution end quote. So essentially, these AI models will quote unquote for get information while simultaneously a set of learned behaviors they have created through synthesizing.

29:10

All this information will begin to converge and lead to a broken model that's no longer really useful. It won't present anything that's of real value. So the researchers argue that quote the use of llm's at scale to publish content on the Internet will pollute the collection of data to train them endo quote. That's bad news, and it's definitely going to be an issue, particularly with sites that fall into the content farm category, because it's already happening right.

29:42

There are already websites out there that have turned to AI generation to flesh out the articles that they have in their database, and these articles are of a varying quality, and all of those getting scooped up in a future AI model session and used side by side with articles that were researched written and edited by human beings and therefore potentially at least of higher quality. I'm not saying that all human written articles and edited articles are great.

30:16

They're not. There's some bad stuff out there that human beings have written. But with those steps in place, you have the potential for really great work. With AI. You don't necessarily get that. You hope you get it, but there's no guarantee and there aren't enough I would say safety valves to make sure that things don't go off the rails. So getting back to content farms, if you are unfamiliar with that term, well don't worry. You've almost certainly come across a content farm at some point in

30:51

the past. So these are sites that just churn out an enormous amount of content, typically in an effort to tap into the sweet sweet waters of SEO, which stands for search engine optimization. So for a lot of websites out there, the majority of traffic coming to the website comes courtesy of a search engine. And when I say a search engine, you might as well fill in the name Google there, because that's the big one. I mean.

31:19

There are other search engines out there, and some of them do contribute to this too, but Google commands somewhere between eighty and like ninety five percent of the search market. Exactly where that it falls is a matter of debate. Like I looked at a few different Internet analytics sites, right, and they had different percentages, but there was always above eighty percent, and some as high as like ninety two or ninety three. So it's safe to say that Google

31:47

dominates the search space. You know, technically it may not be a monopoly, but effectively it kind of is. So sites that depend on traffic from search naturally want to find ways for their pay is to rank high in search results and to appear in more search results. Now that's actually easier said than done. Google has changed its page ranking algorithm a few different times, and some search

32:13

results are dependent upon who is doing the searching. That means that you and I might each search for the exact same thing, maybe we word it the exact same way, but we'll end up getting different results. Google says, quote personalization is only used in your results if it can provide more relevant and helpful information end quote. So presumably

32:36

it doesn't happen all the time. That means that in some cases you and I will get identical results depending upon what it is we're searching for, and in other cases we will get very different search results. I do know this makes SEO a much larger challenge because it's impossible to be all things to all people. You know, you can only do the best you can to try

32:59

and show up for any given search query. It is super duper hard if you're dependent upon human writers and editors to generate all the stuff that you're shoving out in an effort to get clicks. So most of your traffic is coming from search. We talked about this already. You need to have lots of stuff on your site that people could be searching for so that traffic comes your way, and that way you can make money through

33:25

web advertising. Essentially, you could try to be reactionary, right, You could try to generate new content as things capture of the public interest, but you run the danger of getting to the party too late and that by the time you have something up, no one's talking about it anymore and you're not really seeing any real traffic from that. What if instead you could just kind of open up

33:48

a fire hose of content using generative AI. Well, if you just had AI write a whole bunch of articles in the style that you've established for your company, and maybe, if you're feeling a little cautious, you'll even employ a couple of human being editors to take on the job of reading over these generated articles and to correct any mistakes that were made, and perhaps even tweak a couple of things here and there to make it sound more

34:13

human if necessary. But now you can push out way more content without having to wait on human writers to research and write everything. Plus, AI does not complain if you assign it to write a suite of articles about gluten free skincare products. By the way, I'm using my real world life experience with that last example. I once got that writing assignment. It was dumb then and it's dumb now, But I guess people were searching for it,

34:42

so I got an assignment to write it. Now. I would like to think that the site I was writing for, which was how Stuffworks dot Com, wasn't really a content farm. I would love to think that, and I would argue that for many years when I wrote there, it did not qualify as a content farm. We did try to write in depth, authoritative articles about all sorts of stuff, like whether we were talking about technology or society, or

35:09

money or entertainment, whatever it might be. We applied rigor, you know, journalistic rigor, toward the research and writing and editing of those pieces. Over time, things changed where we started to cater more toward ad deals, where we would get this big ad deal with a company like a you know, cosmetics company, for example, and we would suddenly have hundreds of articles assigned in the field of cosmetics, articles that were incredibly niche like, there was no way

35:44

that we're going to drive a ton of traffic. But collectively then these articles could get a lot of traffic. Not a single one, but across the board. If someone happened to be searching for this thing, they could find their way to our article and that would be another click coming our way. It was a very much a shotgun approach to writing content. I hated it. There were articles I wrote that I am not at all. It's not that I'm not proud of the work I did.

36:14

I'm not proud of getting the assignment, like it was a joke in my opinion, But that's what we were trying to do in order to survive. Because again, how stuff works was like one of these websites in that most of the traffic coming through How Stuff Works came through a search engine. Someone was looking to learn how something worked and they got sent our way. People weren't, as a rule, just coming to How Stuff Works to

36:37

peruse the website. We always wanted that that was what our goal was, to create a destination website that people would want to go to just to see, oh, what's new on the site, But we never really achieved that. It's a really hard thing to do. There are people who do it and it's amazing, but it's not easy to replicate. So instead we wrote tons of articles about stuff that people were searching for, and that just kind

37:03

of was our mo at that point. Anyway, if you're using AI to create these kinds of articles, it's going to generate a lot of stuff that's just not very good. But then who cares? Like you don't necessarily care if the material is good. If the only traffic you're really getting on your website is coming from search engines, you

37:27

just need it to show up in the search engines. Now, if the search engine is able to determine, hey, this is low quality content, and it disincentivizes people visiting by making it go further down the search results, then you're going to have a problem, and a lot of content farms ran into that problem. Google downgraded content farms in

37:47

their search algorithm. Other sites like duck duck go removed websites that were considered content farms because the people running duck dot go realized, Hey, these sites aren't inviting anything of real value to visitors. Why are we even serving it up. That's not really a good use of anyone's time. But if you're in a space where the jig isn't up yet, you might as well just go ahead and create as much garbage as you can because you just

38:17

want the clicks. You don't care if people actually think the articles are of good quality or that they're going to learn anything useful. You don't even necessarily care if the articles are accurate. You care that people are clicking

38:29

on the articles. So if that's your perspective, ultimately, then the goal for you is to push as much of this stuff out the door as you possibly can, generate it as fast as possible, get it online as quickly as you can, and hope that starts to rank in search so that people flood in to read about whatever it is you're writing about. But it's not just people who are going to your links, is it. There are

38:58

bots crawling the web now. Some of them are calling the web in order to index those web pages for the purposes of things like search engines, but other bots are there to scrape data for the purposes of training the next generation of large language models. Essentially, at this point, bots are reading articles that were written by other bots, and so when the next large language model launches, it does so on a data set that has been polluted

39:24

by bot generated information. That means the next generation will be even worse, and so on, and eventually we arrive at a point where the Internet, this amazing invention that provides access to practically all of human knowledge, becomes absolutely infested with junk that is inaccurate and increasingly nonsensical, and we render this incredible invention useless. This isn't just speculation either. We have examples of companies turning to AI to generate articles.

39:58

C Net famously did this early in the days of generative AI, and cnet properly got roasted for doing it, first roasted for not presenting it in a way that was transparent, and then also for including articles that just had outright wrong information in them and publishing them as if they were vetted pieces that editors had gone through. How stuff works. Again, my old employer where I got that Skincare writing assignment once upon a time, they've done

40:28

this too. They laid off their human writers, they stopped giving assignments to freelancers. Later on they laid off the entire editorial staff after the editors protested this move toward AI generated content. This trend is happening. Not only are talented people being put out of work, which is bad enough already. These editors and writers, they believed in what

40:52

they were doing. Yeah, sometimes the assignments stank, sometimes they were not good, but the writers and editors still believed in doing as good a job as they possibly could. But their replacements, the AI, they're just making the Internet worse by generating unreliable and terrible content. Then no one actually wants to read unless they just happen to put that particular set of terms into a search engine and

41:20

the search engine couldn't find anything better to serve them. Again, it's as if you needed to learn something important, but all you have access to are just sloppily written articles by people who had no understanding or passion about the subject matter they were writing on, and there were no editors to steer the writer toward creating a more accurate or informative piece. It gets pretty darn bleak. Is it inevitable? Though? No,

41:47

it's not inevitable. This future happens if the people who are training the AI models allow it to happen, but with careful stewardship. By guiding the AI models so that they don't pull training data from garbage sites and they really focus on reputable sources, it's possible to avoid these issues at least in some part. I mean some things like hallucinations, confabulations, that kind of stuff that can happen anyway,

42:14

but you can at least limit it. That's not really what we're seeing right now, though, because at the moment, companies are rushing into the AI space right they are pushing so hard to create large language models that dwarf the previous generation's capabilities. So to do that, they have to seek out training data from all across the internet. You have to train these AI models on tons and tons of information to make them useful. The more data

42:42

you have access to the better. Social platforms have provided a popular source of information. We know that Reddit has struck deals with open ai, for example, in order to crawl Reddit to pull information. But you know what, social platforms are also really popular with bots, not just with people. So even this approach brings with it the risk of AI training on other AI generated data, which again leads

43:08

to model collapse. Further down the road, I might one day do a much more in depth episode about this paper, the curse of recursion. Training on generated data makes models forget. I've given a very high level summary of what the researchers say in that paper, but it might benefit us to take a much closer look at what they found and their conclusions. So I may revisit this topic in the future, but for now, I think it's just good to remember that AI does have the potential to do

43:39

great things. I mean, it can potentially augment our work efforts and let us accomplish goals more quickly and efficiently and accurately. But AI also has the potential to make things miserable and churn out content that no one wants to see other than other bots, and creating a cynical cycle that ultimately could make the Internet into a cluttered, practically useless mess. So which way are we going to go?

44:06

I think my answer day to day depends on how optimistic I feel, but at the very least, I think knowing about the risks is important. That's it for today's episode. I hope you are all well. I will try to get away from AI topics. I know I've been covering a lot of that recently, and that'd be nice to kind of branch into other areas of tech, So I'm going to try and do that. It's just AI stuff just keeps on happening, y'all. But I will talk to

44:32

you again really soon. Tech Stuff is an iHeartRadio production. For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts, or wherever you listen to your favorite shows.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript