Deep Learning and Deepfakes

Speaker 1

00:04

Welcome to Tech Stuff, a production from I Heart Radio. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. I'm an executive producer with I Heart Radio and I love all things tech. Now, before I get into today's episode, I want to give a little listener warning here. The topic at hand involves some adult content, including the use of technology to do stuff that can

00:33

be unethical, illegal, hurtful, and just plain awful. Now, I think this is an important topic, but I wanted to give a bit of a heads up at the start of the episode, just in case any of you guys are listening to a podcast on like a family road trip or something. I think this is an important topic and I think everyone should know about it and think about it. But I also respect that for some people this subject might get a bit taboo. So let's go

01:00

on with the episode. Back in nine, a movie called Rising Sun, directed by Philip Kaufman, based on a Michael Crichton novel and starring Wesley Snipes and Sean Connery came out in theaters. Now, I didn't see it in theaters, but I did catch it when it came on you know, HBO or Cinemax or something. Later on, the movie included a sequence that I found to be totally unbelievable. And I'm not talking about buying into Sean Connery being an

01:32

expert on Japanese culture and business practices. Actually, side note, Sean Connery has an interesting history of playing unlikely characters, such as in Highlander, where he played an immortal who was supposedly Egyptian, then who lived in feudal Japan and ended up in Spain where he became known as Ramirez. And all the while he's talking to a Scottish Highlander who's played by a Belgian actor. But I'm getting way

01:58

off track here. Besides, I've heard Crichton actually wrote the character while thinking of Connery, So you know, what the heck do I know? In the film, Snives and Connery are investigators, and they're looking into a homicide that happened at a Japanese business but on American soil. The security system in the building captured video of the homicide and the identity of the killer appears to be a pretty open and shut case. But that's not how it all

02:26

turns out. The investigators talked to a security expert played by Tia Carrera, and she demonstrates in real time how video footage can be altered. She records a short video of Connery and snipes loads that onto a computer, freezes a frame of the video, and essentially performs a cut and paste job swapping the heads of our two lead characters. Then she resumes the video and the head swap remains

02:55

in place, and that head swap stuff is possible. I mean, clearly it has to be possible, because you actually do see that effect in the film itself. But it takes a bit more than a quick cut and paste job. But we'll leave off of that for now. The whole point of that sequence, apart from showing off some cinema magic, is to demonstrate to the investigators that video, like photographs, can be altered. The expert has detected a blue halo around the face of the supposed murderer in the footage,

03:28

indicating that some sort of trickery has happened. She also reveals that she cannot magically restore the video to its previous unaltered state, which I think was actually a nice change of pace for a movie. By the way, I think this movie is really, you know, not good, like not worth your time, but that's my opinion anyway. For years, this kind of video sorcery was pretty much limited to

03:54

the film and TV industries. It usually required a lot of pre planning beforehand, so it wasn't as simple as just taking footage that was already shot and changing it in post on a whim with a couple of clicks of a button. If it were, we would see a lot fewer mistakes left in movies and television because you could catch it later and just fix it. But the tricks were possible, they were just difficult to pull off. It just wasn't something you or I would ever encounter

04:23

in our day to day lives. But today we live in a different world, a world that has examples of synthetic media. Commonly referred to as deep fakes. These are videos that have been altered or generated so that the subject of the video is doing something that they probably would or could never do. They've brought into question whether or not video evidence is even reliable, much as the film Rising Sun was talking about. We already know that

04:54

eyewitness testimony is terribly unreliable. Our perception and memory play tricks on us, and we can quote unquote remember stuff that just didn't happen the way things actually unfolded in reality. But now we're looking at video evidence and potentially the

05:13

same light. I mean, it's scary. So today we're going to learn about synthetic media, how it can be generated, the implications that follow with that sort of reality, and ways that people are trying to counteract a potentially dangerous threat, you know, fun stuff. Now, first, the term synthetic media has a particular meaning. It refers to art created through some sort of automated process, so it's a largely hands

05:43

off approach to creating the final art piece. Now, under that definition, the example of rising sun would not apply here because we see in the film and presumably this happens in the book as well, but I haven't read the book that a human being actually changes that. People have used tools to alter the video footage. This would be more like using photoshop to touch up a still image, with the computer system presumably doing some of the work

06:14

in the background to keep things matched up. Either that or you would need to alter each image in the footage frame by frame, or use some sort of matt approach. To learn more about matts, you can listen to my episode about how blue and green screens work. Synthetic media as a general practice has been around for centuries. Artists have set up various contraptions to create works with little or no human guidance. In the twentieth century we started

06:43

to see a movement called generative art take form. This type of art is all about creating a system that then creates or generates the finished art piece. That would mean that the finished work, such as a painting, wouldn't reflect the feelings or thoughts of the art is who created the system. In fact, it starts to raise the question what is the art? Is it the painting that came about due to a machine following a program of

07:11

some sort, or is the art the program itself? Is the art the process by which the painting was made? Now I'm not here to answer that question. I just think it is an interesting question to ask. Sometimes people ask much less polite questions, such as is it art at all? Some art critics went out of their way to dismiss generative art in the early days. They found it insulting, but hey, that's kind of the history of art in general. Each new movement and art inevitably finds

07:46

both supporters and critics as it emerges. If anything, you might argue that such a response legitimizes the movement in you know, a weird way. If people hate it, it must be something. In two thousand eighteen, an artist collective called Obvious located out of Paris, France. They submitted portrait style paintings that were created not by an actual human painter, but by an artificially intelligent system. Now they looked a

08:16

lot like typical eighteenth century style portraits. There was no attempt to pass off the portrait as if it were actually made by a human artist. In fact, the appeal of the piece was largely due to it being synthetically generated. It went to auction at Christie's and the AI created painting fetched more than four hundred thousand dollars. And the way the group trained their AI is relevant to our

08:45

discussion about deep fakes. The collective relied on a type of machine learning called generative adversarial networks or g a N, which in turn is depending on deep learning. So it looks like we've got a few things we're going to have to define here. Now, I'm going to keep things fairly high level, because as it turns out there are a few different ways to create machine learning models, and to go through all of them in exhaustive detail would

09:14

represent a university level course in machine learning. I have neither the time for that nor the expertise. I would do a terrible job, So we'll go with a high level perspective here first. A generative adversarial network uses two systems. You have a generator and you have a discriminator. Both of these systems are a type of neural network. A neural network is a computing model that is inspired by

09:42

the way our brains work. Our brains contain billions of neurons, and these neurons work together, communicating through electrical and chemical signals, controlling and coordinating pretty much everything in our bodies. With computers, the neurons are Note the job of a node is, you know, supposed to be kind of like a neuron cell in the brain. It's to take in multiple weighted

10:09

input values and then generate a single output value. Now, the word weighted w E I G H T E D weighted is really important here because the larger and inputs weight, the more that input will have an effect on whatever the output is. So it kind of comes down to which inputs are the most important for that nodes particular function. Now, if I were to make an analogy, I would say, your boss hands you three tasks to do.

10:41

One of those tasks has the label extremely important, and the second task has the label critically important, and the third task has a label saying you should have finished that one before it was handed to you. Okay, so that's just some sort of snarky office humor that I need to get off my chest. But more seriously, imagine a node accepting three inputs. In this example, input one has a fifty weight, Input two has a weight, and

11:09

input three has a ten percent weight. That adds up to and that would tell you that the output that node generates will be most affected by input one, followed by input two, and then input three would have a smaller effect on whatever the output is. Each node applies a nonlinear transformation on the input values, again affected by each inputs weight value, and that generates the output value.

11:39

The details of that really are not important for our episode, and involves performing changes on variables that in turn change the correlation between variables, and it gets a bit Matthew, and we would get lost in the weeds pretty quickly. The important thing to remember is that a node within a neural network takes in a weighted sum inputs, then performs a process on those inputs before passing the result

12:06

on as an output. Then some other node a layer down will accept that output, along with outputs from a couple of other nodes one layer up, and then we'll perform an operation based on those weighted inputs and pass that on to the next layer, and so on. So these nodes are in layers, like you know a cake. One layer of notes processes some inputs, they send it on to the next layer of nodes, and then that one does onto the next one, and the next one

12:35

and so on. This isn't a new idea. Computer scientists began theorizing and experimenting with neural network approaches as far back as the nineteen fifties with the perceptron, which was a hypothetical system that was described by Frank Rosenblatt of Cornell University. But it wasn't until the last decade that computing power and our ability to handle a lot of data reached a point where these sort of learning models

13:04

could really take off. The goal of this system is to train it to perform a particular task within a certain level of precision. The weights I mentioned are adjustable, so you can think of it as teaching a system which bits are the most important in order to do whatever it is the system is supposed to do in order to achieve your task, These are the bits that are the most important and therefore should matter the most

13:32

when you weigh a decision. This is a bit easier if we talk about a similar system with the version of IBM S Watson that played on Jeopardy. That system famously was not connected to the Internet. It had to rely on all the information that was stored within itself. When the system encountered a clue in Jeopardy, it would analyze the clue, and then it would reference its data base to look for possible answers to whatever that clue was.

14:01

The system would weigh those possible answers and attempt to determine which, if any, were the most likely to be correct. If the certainty was over a certain threshold, such as sure, the system would buzz in with its answer. If no response rose above that threshold, the system would not buzz in, So you could say that Watson was playing the game with a best guess sort of approach. Neural networks do

14:28

essentially that sort of processing. With this particular type of approach, we know what we want the outcome to be, so we can judge whether or not the system was successful. After each attempt, we can adjust the weight on the input between nodes to refine the decision making process to get more accurate results. If the system succeeds in its task, we can increase the weights that contributed to the system picking the correct answer and thus decrease the input it's

15:00

that did not contribute to the successful response. If the system done messed up and gave the wrong answer, then we do the opposite. We look at the inputs that contributed to the wrong answer, we diminish their weights, and we increase the weights of the other input and then we run the test again a lot. I'll explain a bit more about this process when we come back, but

15:25

first let's take a quick break. Early in the history of neural networks, computer scientists were hitting some pretty hard stops due to the limitations of computing power at the time. Early networks were only a couple of layers deep, which really meant they weren't terribly powerful, and they could only tackle rudimentary tasks like figuring out whether or not a square is drawn on a piece of paper that isn't

15:59

terribly sophisticated. In six David Rummelhart, Jeffrey Hinton, and Ronald Williams published a lecture titled learning representations by back propagating errors. This was a big breakthrough with deep learning. This all has to do with a deep learning system improving its

16:19

ability to complete a specific task. And basically the algorithm's job is to go from the output layer, you know, where the system has made a decision, and then work backward through the neural network, adjusting the weights that led to an incorrect decision. So let's say it's a system that is looking to figure out whether or not a cat is in a photograph and it says, there's a cat in this picture, and you look at the picture

16:47

and there is no cat there. Then you would look at the inputs one level back just before the system said here's a picture of a cat, and you'd say, all right, which of these inputs lad the system to leave this was a picture of a cat, And then you would adjust those. Then you would go back one layer up, so you're working your way up the model and say which inputs here led to it giving the outputs that led to the mistake, and you do this all the way up until you get up to the

17:21

input level at the top of the computer model. You are back propagating, and then you run the test again to see if you've got improvement. It's exhaustive, but it's also drastically improved neural network performance, much faster than just throwing more brute force to it. The algorithm essentially is checking to see if a small change in each input value received by a layer of nodes would have led to a more accurate results. So it's all about going

17:51

from that output working your way backward. In two thousand twelve, Alex Krajewski published a paper that gave us the next big breakthrough. He argued that a really deep neural network with a lot of layers could give really great results if you paired it with enough data to train the system. So you needed to throw lots of data at these models, and it needed to be an enormous amount of data. However,

18:17

once trained, the system would produce lower error rates. So yeah, I would take a long time but you would get better results. Now, at the time, a good error rate for such a system was that means one out of four conclusions the system would come to would be wrong. If you ran it across a long enough number of decisions, you would find that one out of every four wasn't right. The system that Alex's team worked on produced results that had an error rate of six percent, so much lower.

18:51

And then in just five years, with more improvements to this process, the classification error rate had dropped down to two point three percent for deep learning systems. So from to two point three it was really powerful stuff. Okay, so you've got your artificial neural network. You've got your layers and layers of nodes. You've adjusted the weights of the inputs into each node to see if your system can identify, you know, pictures of cats, and you start

19:25

feeding images to this system, lots of them. This is the domain that you are feeding to your system. The

19:32

more images you can feed to it, the better. And you want a wide variety of images of all sorts of stuff, not just of different types of cats, but stuff that most certainly is not a cat, like dogs, or cars or chartered public accountants, you name it, and you look to see which images the system identifies correctly and which ones it screws up, both which images have cats in it that actually don't have cats in it, or images the system has identified as saying there is

20:01

no cat here, but there is a cat there. This guides you into adjusting the weights again and again, and you start over and you do it again, and that's your basic deep learning system, and it gets better over time as you train it. It learns. Now, let's transition over to the adversarial systems I mentioned earlier, because they take this and twist it a little bit. So you've got two artificial neural networks and they are using this general approach to deep learning, and you're setting them up

20:34

so that they feed into each other. One network, the generator, has the task to learn how to do something such as create an eighteenth century style portrait based off lots and lots of examples of the real thing. The domain the problem domain. The second network, the discriminator, has a different job. It has to tell the difference between authentic port traits that came from the problem domain and computer

21:04

generated portraits that came from the generator itself. So essentially the discriminator is like the model I mentioned earlier that was identifying pictures of cats. It's doing the same sort of thing, except instead of saying cat or no cat, it's saying real portrait or a computer generated portrait. So there are essentially two outcomes the discriminator could reach, and that's whether or not an images computer generated or it wasn't. So do you see where this is going? You train

21:31

up both models. You have the generator attempt to make its own version of something such as that eighteenth century portrait. It does so it designs the portrait based on what the model believes are the key elements of a portrait, So things like colors, shapes, the ratio of size, like you know, how large should the head be in relation to the body. All of these factors and many more

21:58

come into play. The generator creates its own idea of what a portrait is supposed to look like, and chances are the early rounds of this will not be terribly convincing. The results are then fed to the discriminator, which tries to suss out which of the images fed to it

22:17

our computer generated and which ones aren't. After that round, both models are tweaked the generator adjusts input weights to get closer to the genuine article, and the discriminator adjust weights to reduce false positives or to catch computer generated images. And then you go again and again and again and again,

22:39

and they both get better over time. So, assuming everything is working properly, over time, the adjustment of input weights will lead to more convincing results, and given enough time and enough repetition, you'll end up with a computer generated painting that you can auction off for nearly half a million dollars. Though keep in mind that huge price or dates back to the novelty of it being an early AI generated painting. It would be shocking to me if

23:07

we saw that actually become a trend. Also, the painting, while interesting, isn't exactly so astounding as to make you think there's no way a machine did that. You'd look at them and go, yeah, I can imagine a machine did that. One. A group of computer scientists first described the general adversarial network architecture in a paper in two thousand and fourteen, and like other neural networks, these models

23:30

require a lot of data. The more the better. In fact, smaller data sets means the models have to make some pretty big assumptions, and you tend to get pretty lousy results. More data, as in more examples, teaches the models more about the parameters of the domain, whatever it is they are trying to generate. It refines the approach. So if you have a sophisticated enough pair of models and you have enough data to fill up a domain, you can

23:57

generate some convincing material. And that in ludes video and this brings us around to deep fakes. And in addition to generative adversarial networks, a couple of other things really converged to create the techniques and trends and technology that would allow for deep fakes proper. In Malcolm Slaney, Michelle Covell, and Christoph Bregler wrote some software that they called the

24:26

Video Rewrite Program. The software would analyze faces and then create or synthesize lip animation which could be matched to pre recorded audio. So you could take some film footage of a person and then reanimate their lips so that they could appear to say all sorts of things, which

24:47

in some ways set the stage for deep fakes. This case, it was really just focusing on the lips and the general area around the lips, so you weren't changing the rest of the expression of the face, and you would have to, you know, keep your recording to be about the same length as whatever the film clip was, or you would have to loop the film clip over and over, which would make it, you know, far more obvious that

25:12

this was a fake. In addition, motion tracking technology was advancing over time too, and this also became an important tool in computer animation. This tool would also be used by deep fake algorithms to create facial expressions, manipulating the digital image just as it would if it were a video game character or a Pixar animated character. Typically, you need to start with some existing video in order to

25:38

manipulate it. You're not actually computer generating the animation, like, you're not creating a computer generated version of whomever it is you're you're doing the fake of You're using existing imagery in order to do that and then manipulating that existing imagery, so it's a little different from computer animation.

25:59

In two thousands six teen, students and faculty at the Technical University of Munich created the face to Face project that would be face the numeral two and then face and this was particularly jaw dropping to me at the time. When I first saw these videos back in ten, I was floored. They created a system that had a target actor. This would be the video of the person that you want to manipulate. In the example they used, it was former US President George W. Bush. Their process also had

26:33

a source actor. This was the source of the expressions and facial movements you would see in the targets, so kind of like a digital puppeteer in a way. But the way they did it was really cool. They had a camera trained on the source actor and it would track specific points of movement on the source actor's face, and then the system would manipulate the same points of

26:58

movement on the target actor's face in the video. So if the source actor smiled, then the target smiled, so the source actor would smile, and then you would see George W. Bush in the video smile in real time. It was really strange. They used this looping video of

27:18

George W. Bush wearing a neutral expression. They had to start with that as there they're sort of zero point, and I gotta tell you, it really does look like the former president George W. Bush is having a bit of a freak out on a looping video because he keeps on opening his mouth, closing his mouth, grimacing, raising his eyebrows. You need to watch this video. It is

27:43

still available online to check out. In Students and faculty over at the University of Washington created the Synthesizing Obama project, in which they trained a computer model to generate a synthetic video of former US President Barack Obama, and they made it lip sync to a pre recorded audio clip from one of Obama's addresses to the nation. They actually had the original video of that address for comparison, so they could look back at that and see how they're

28:15

generated one compared to the real thing. And their approach used a model that analyzed hundreds of hours of video footage of Obama speaking, and it mapped specific mouth shapes to specific sounds. It would also include some of Obama's mannerisms, such as how he moves his head when he talks or uses facial expressions to emphasize words. And watching the video and that, you know the real one next to the generated one is pretty strange. You can tell the

28:47

generated one isn't quite right. It's not matching the audio exactly, at least not on the early versions, but it's fairly close, and it might even pass casual inspection for a lot of people who weren't, like, you know, actually paying attention. Authors Morass and Alexandro defined deep fakes as quote the product of artificial intelligence applications that merge, combine, replace, and superimpose images and video clips to create fake videos that

29:17

appear authentic end quote. They first emerged in two seventeen, and so this is a pretty darn young application of technology. One thing that is worrisome is that once someone has access to the tools, it's not that difficult to create

29:34

a deep fake video. You pretty much just need a decent computer, the tools, a bit of know how on how to do it, and some time you also need some reference material, as in like videos and images of the person that you are replicating, and like the machine learning systems I've mentioned, the more reference material you have, the better. That's why the deep fakes you encounter these days tend to be of notable famous people like celebrities

30:03

and politicians. Mainly there's no shortage of reference material for those types of individuals, and so they are easier to replicate with deep fakes than someone who maintains a much lower profile. Not to say that that will always be the case, or that there aren't systems out there that can accept smaller amounts of reference material. It's just harder to make a convincing version with fewer samples. But in order to make a convincing fake, the system really has

30:35

to learn how a person moves. All those facial expressions matter. It also has to learn how a person sounds. Will get into sound a little bit later. But mannerisms, inflection, accent, emphasis, cadence, quirks and ticks, all of these things have to be analyzed and replicated to make a convincing fake, and it has to be done just right, or else it comes off as creepy or unrealistic. Think about how impressionists will take a celebrity's manner of speech and then heighten some

31:06

of it in comedic effect. You'll hear all the time with folks who do impressions of people like Jack Nicholson or Christopher Walkin or Barbara streisand people who have a very particular way of speaking. Impressionists will take those as markers and they really punch in on them. Well, a deep fake can't really do that too much, or else it won't come across as genuine. It'll feel like you're watching a famous person impersonating themselves, which is weird. Now.

31:35

The earliest mention of deep fakes I can find dates to a two thousand seventeen Reddit forum in which a user shared deep faked videos that appeared to show female celebrities in sexual situations. Heads and faces had been replaced, and the actors in pornographic movies had their heads or

31:54

faces swapped out for these various celebrities. Now the fakes can look fairly convincing, extremely convincing in some cases, which can lead to some people assuming that the videos are genuine and that the folks that they saw in the videos are really the ones who are in it. And

32:14

obviously that's a real problem, right. I mean that this technology we've given enough reference data DEFEATA system, someone could fabricate a video that appears to put a person in a compromising position, whether it's a sexual act or making damaging statements or committing a crime or whatever. And there are tools right now that allow you to do pretty much what the face to face tool was doing back in two thousand sixteen. A program called avatar if I,

32:41

which is not that easy to say anyway. It can run on top of live streaming conference services like Zoom and Skype, and you can swap out your face for a celebrities face. Your facial expressions map to the computer manipulated celebrity face uh that just looks at you through your webcam, and then if you smile, the celebrity image smiles, etcetera.

33:06

It's like that old face to face program. It does need a pretty beefy PC to manage doing all this because you're also running that live streaming service underneath it. It's also not exactly user friendly. You need some programming experience to really get it to work. But it is widely accessible as the source code is is open source and it's on get hubs, so anyone can get it.

33:32

Samantha Cole, who writes for Vice, has covered the topic of deep fakes pretty extensively and the potential harm they can cause, and I recommend you check out her work if you're interested in learning more about that. Do be warned that Coal covers some pretty adult themed topics and I think she does great work and very important work. But as a guy who grew up in the Deep South, it's also the kind of stuff that occasionally makes me clutch my purse roles. But that's more of a statement

34:01

about me than her work. She does great work. I think most of us can imagine plenty of scenarios in which this sort of technology could cause mischief on a good day and catastrophe on a bad day, whether it's spreading misinformation, creating fear and certainty and doubt fud or by making people seem to say things they never actually said, or contributing to an ugly subculture in which people try to make their more base fantasies a reality by putting

34:32

one person's head on another person's body. You know, it's not great. There are legitimate uses of the technology too, of course, you know, tech itself is rarely good or bad. It's all in how we use it. But this particular technology has a lot of potentially harmful uses, and Samantha Coll has done a great job explaining them. When we come back, I'll talk a bit more about the war against deep fakes and how people are trying to prepare for a world that is increasingly filled with media we

35:01

can't really trust. But first, let's take a quick break. Before the break, I mentioned Samantha Cole, who has written extensively about deep fags, and one point she makes that I think is important for us to note is that the vast majority of instances of deep fake videos haven't been some manufactured video of a political leader saying inflammatory things.

35:33

That continues to be a big concern. There's a genuine fear that someone is going to manufacture a video in which a politician appears to say or do something truly terrible in an effort to either discredit the politician or perhaps instigate a conflict with some other group. There are literal doomsday scenarios in which such a video would prompt a massive military response, though it does seem like it

36:01

might be a little far fetched. Though heck, I don't know, considering the world we live in, maybe it's not that big of a stretch anyway. Cole's point is that so far, debt has not happened. She points out that the most frequent use for the tech either tends to be people goofing around or disturbingly using it too. In her words, quote take ownership of women's bodies in non consensual porn

36:25

end quote. Cole argues that the reason we haven't really seen deep fix used much outside of these realms, apart from a few advertising campaigns. Is that people are pretty good at spotting Deep Fix. They aren't quite at a level where they can easily pass for the real thing. There's still something slightly off about them. They tend to

36:46

butt up against the uncanny valley. Now, for those of you not familiar with that term, the uncanny valley describes the feeling we humans get when we encounter a robot or a computer generated figure that closely resembles a human or human behavior, but you can still tell it's not actually a person, and it's not a good feeling. It tends to be described as repulsive and disturbing, or at the very best, off putting. See also the animated film

37:18

Polar Express. There's a reason that when that film came out, people kind of reacted negatively to the animation, and it's also a reason why picks are tends to prefer to go with stylized human characters who are different enough from the way real humans look to kind of bypass uncanny valley. We just think of that as a cartoon, not something

37:40

that's trying to pass itself off as being human. But while there hasn't really been a flood of fake videos hitting the Internet with the intent to discredit politicians or infuriate specific people or whatever, there remains a general sense that this is coming. It's just not here now. The sense I get is that people feel it's an inevitability, and there are already folks working on tools that will help us sort out the real stuff from the fakes.

38:07

Take Microsoft, for example. There R and D division fittingly called Microsoft Research, developed a tool they call the Video Authenticator. This tool analyzes video samples and looks for signs of deep fakery. In a blog post written by Tom Bert and Eric Horvitts to Microsoft executives, they say, quote it works by detecting the blending boundary of the deep fake and subtle fading or gray scale elements that might not

38:36

be detectable by the human eye. End quote. Now I'm no expert, but to me, it sounds like the video Authenticator is working in a way that's not too dissimilar to a discriminator in a generative adversarial network. I mean, the whole purpose of the discriminator is to discriminate or to tell the difference between genuine when unaltered videos and computer generated ones. So the video authenticator is looking for tailtale signs that a video was not produced through traditional

39:10

means but was computer generated. However, that's the very thing that the generators in G A N systems are looking out for. So when a generator receives feedback that a video it generated did not slip past the discriminator, it then tweaks those input weights and starts to shift its approach in order to bypass whatever it was that gave away its last attempt, and it does this again and again.

39:38

So the video authenticator might work well for a given amount of time, but I would suspect that in the long run, the deep fake systems will become sophisticated enough to fool the authenticator. Of course, Microsoft will continue to tweak the authenticator as well, and it will become something of a seesaw battle as one side outperforms the other temporarily,

40:01

and then the balance will shift. Though there may come a time where either the deep fakes are too good and they don't set off any alarms from the discriminator, or the discriminator gets so sensitive that it starts to flag real videos and it hits a lot of false positives and calls them generated videos instead. Either way, you reach a point where a tool like this no longer really serves a useful purpose, and the video authenticator will be obsolete. Now, this is something we see in artificial

40:32

intelligence all the time. If you remember the good old days of capture, you know, the approving you're not a robot stuff. The stuff we were told to do was typically type in a series of letters and numbers, and it wasn't that hard when it first started, at least not at first. That's because the text recognition algorithms of

40:53

the time weren't very good. They couldn't decipher mildly deformed text because the shape to the text felt too far outside the parameters of what the system could recognize as a legitimate letter or number. You make the number a little, you know, deformed, and then suddenly the systems like, well, that doesn't look like a three to me because it's

41:14

not in the shape of a three. But over time people developed better text recognition programs that could recognize these shapes even if they weren't in a standard three orientation, and those systems began to defeat those simple early captures that required captured designers to make tougher versions, and eventually the machines got good enough that they can match or

41:37

even outperform humans. And at that point, those text based captures proved to be more challenging for people than for machines, which meant if you use them, you defeated the whole purpose in the first place. So while this escalation proved to be a challenge for security, it was a boon for artificial intelligence. And while I focused almost exclusively on the imagery of video here, the same sort of stuff is going on with generated speech, including generated speech that

42:04

imitates specific voices like deep big videos. This approach works best if you have a really big data set of recorded audio, so people like movie and TV stars, news reporters, politicians, and um, you know, podcasters, we're great targets for this stuff. There might be hundreds or you know, in my case,

42:27

thousands of hours of recording material to work from. Training a model to use the frequencies timbre, intonation, pronunciation, pauses, and other mannerisms of speech can result in a system that can generate vocals that sound like the target, sometimes to a fairly convincing degree, and for a while to peek behind the curtain here we at tech stuff. We're working with a company that I'm not going to name, but they were going to do something like this as

42:57

an experiment. I was gonna do a whole episode on it, and I had planned on crafting a segment of that episode only through text. I was not going to actually record it myself and then use a system that was trained on my voice to replicate my voice and deliver

43:16

that segment on its own. I was curious if it can nail not just the audio quality of my voice, which, let's be honest, is amazing that sarcasm I can't stand listening to myself, but it would also have to replicate how I actually make certain sounds, Like would it get the bit of the southern accent that's in my voice,

43:37

or the way I emphasize certain words. Would it pause for effect at all or would it just robotically say one word after the next and only pause when there was some helpful punctuation that told it to do so. Would it indicate a question by raising the pitch at the end of its sentence. Sadly, we never got far with that particular problem check, so I don't have any

44:01

answers for you. I don't know how it would have turned out, but clearly one of the things I thought of was that it's a bit of a red flag. If you can train a computer to sound exactly like a specific person, that means you can make that person say anything you like, and obviously, like deep fake videos, that could have some pretty devastating consequences if it were

44:23

at all, you know, believable or seemed realistic. Now, the company we were working with was working hard to make sure that the only person to have access to a specific voice would be the owner of that voice, or presumably the company employing that person. Though that does bring up a whole bunch of other potential problems, like can you imagine eliminating voice actors from a job because you've got enough of their voice and you can just replicate it.

44:50

That wouldn't be great, But even so, it was something I felt was both fascinating from a technology standpoint and potentially problematic when it comes to an application of that technology. One other thing I should mention is that the Internet at large has been pretty active in fighting deep fakes, not necessarily in detecting them, but removing the platforms from which they were being shared, Reddit being a big one. The subreddit that was dedicated to deep fakes what had

45:17

been shut down. So there have been some of those moves as well. Now this is not directly against the technology, it's more against the proliferation of the uh the output of that technology. As for detecting deep fakes, it's interesting to me that people are even developing tools to detect them, because to me, the best tools so far seems to be human perception. It's not that the images aren't really convincing, or that we can suddenly detect these, you know, blending

45:49

lines like the video Authenticator tool. It's rather that it's just not hard for us to spot a deep fake. Now, stuff just doesn't quite look right in the way that people behave in these videos. The vocals and animation often don't quite match. The expressions aren't really natural, the progression of mannerisms feels synthetic and not genuine. It just it looks off. It's that uncanny Valley thing, and so just paying attention and thinking critically can really help use suss

46:21

out the fakes from the real thing. Even if we reach a point where machines can create a convincing enough fake to pass for reality. We can still apply critical thinking, and we always should. Heck, we should be applying critical thinking even when there's no doubt as to the validity of the video, because there may be enough to doubt the content of the video itself. If I listen to a genuine scam artist in a genuine video, that doesn't make the scam more legitimate. We always need to use

46:53

critical thinking. What I think is most important is that we acknowledge the very real fact that there are numerous organizations, agencies, governments, and other groups that are actively attempting to spread misinformation and disinformation. There are entire intelligence agencies dedicated to this endeavor, and then there are more independent groups that are doing it for one reason or another, typically either to advance a particular political agenda or just to make as much

47:25

money as quickly as possible. This is beyond doubt or question. There are numerous misinformation campaigns that are actively going on out there in the real world right now. Most of them are not depending on deep fakes, because one, deep fakes aren't really good enough to fool most people right now, and too, they don't need the deep fakes in the first place. There are other methods that are simpler, that don't need nearly the processing power that work just fine.

47:56

Why would you go through the trouble of synthesizing a video if you can get a better response with a blog post filled with lies or half truths. It's just not a great return on investment. So bottom line, be vigilant out there, particularly on social media. Be aware that there are plenty of people who will not hesitate to mislead others in order to get what they want. Use a critical eye to evaluate the information you encounter. Ask questions,

48:26

check sources, look for corroborating reports. It's a lot of work, but trust me, it's way better that we do our best to make sure the stuff we're depending on is actually dependable. It'll turn out better for us in the long run. Well, that wraps up this episode of text stuff, which yeah, I used as a backdoor to argue about critical thinking. Again, sue me, don't, don't really sue me. But I think that that's another instance where it's a really clear example where we have to use that kind

48:58

of stuff. So I'm gonna keep keep on stressing it. And you guys are awesome. I believe in you. I think that when we start using these tools at our disposal that everybody can develop just with some practice, that things will be better. We'll be able to suss out the nonsense from the real stuff, and we're all better off in the long run if we can do that. If you guys have suggestions for future topics I should cover in episodes of tech Stuff, let me know via Twitter.

49:29

The handle is text stuff H s W and I'll talk to you again really soon. Text Stuff is an I Heart Radio production. For more podcasts from my Heart Radio, visit the i Heart Radio app, Apple Podcasts, or wherever you listen to your favorite shows.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript