How AI Can Make You Look Like a Better Dancer

Speaker 1

00:04

Get in touch with technology with tech Stuff from how stuff works dot com. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. I'm an executive producer and I love all things tech. And here's a fun fact about getting older. As you age, stuff that you once thought was impossible will become not only possible, but will become the norm, and future generations won't even think about what it must have been like before the impossible was commonplace.

00:37

Now this is true for every generation. It's not like this is, you know, a brand new, groundbreaking observation. Plenty of people have made it before me, but I want to talk about a specific implementation. For example, with photography, it used to be pretty difficult to manipulate pictures convincingly. There have been photo and tools for decades, but generally it took a great deal of skill and training, plus access to specialized equipment to pull it off, especially in

01:10

the old film days. Now, gradually, tools like Photoshop made it easier to manipulate digital images. Now it still requires a certain level of skill to pull off a really convincing job, and it's easy to make really badly manipulated images. But as these tools became widely available, people began to learn how to use them. We had to come to the realization that we cannot necessarily believe our own eyes when we're looking at a digital image. Now, the same

01:40

thing is happening with video footage. It's quite possible to fake video footage, though again, if you want to do it really well, it requires some skill, some specialized tools, and to really get some expertise in it in order to do it in a way that's really convincing. But it's a pretty recent techn logical capability in fiction, however, it's been around for a long time. I remember seeing

02:06

the movie Rising Sun back in the early nineteen nineties. Now, in that film, Wesley Snipes plays a detective and Sean Connery in his most convincing roles, since he played an Egyptian immortal posing as a Spaniard with a Scottish accent in Highlander, would play a Japanese customs and culture expert

02:28

Sean Connery. Anyway, the film is a mystery thriller, and while the two are investigating a murder, they come into possession of some video footage, and they find out that video footage has actually been manipulated it was planted for them to find so that would put them on the

02:46

wrong trail. One person's face was replaced with someone else's, and in a somewhat comedic scene, the video editor who is explaining this casually swaps the heads of Snipes and Connor in real time in a video feed to show off this capability, which might have been a tad unrealistic, but today we are in a world in which video manipulation of increasingly convincing quality is achievable in real time.

03:14

In fact, these days, it's possible to use sophisticated computer algorithms to allow for manipulation of captured video, almost as if the video was a computer generated cartoon reacting to real time inputs like a video game controller, only instead of it being a video game, it's a real person

03:32

on video. There are, of course, lots of ways this technology could be used unethically, and one of the best known has been the focus of this whole conversation around video manipulation, and it comes from a former Reddit user

03:48

who went by the handle deep fakes. Now that handle has become the shorthand for the general practice, which frequently, but not exclusively, would involve replace the face of an actor in a pornographic scene with someone else's face like that of a celebrity, which is pretty darn unethical and creepy. The name itself was a reference to the technology used in the approach, so it relies on a process called deep learning. Deep learning is a type of machine learning.

04:22

It's a sub type of machine learning that utilizes artificial neural networks. And I've talked an awful lot about those kind of networks recently, so I'm not going to go over the whole thing again. Will give just a really quick rundown to say, you have nodes, artificial neurons in these networks that receive input from potentially multiple other nodes, and then on that input, your artificial neuron that you're looking at will perform some sort of weighted operation and

04:52

produce a single output. That single output can move on to become one of many inputs for a different ARTIFICI shoal neuron in that network, and so on and so forth. Deep learning networks are very very large artificial neural networks, and they can accept a large amount of training data. This is a scalable approach. This means the larger the network and the more data you can feed it, the better it performs. This is different from any other machine

05:22

learning models. Those tend to hit a performance plateau once you hit a certain size, which means that if you were to add more nodes to the network, you wouldn't necessarily see a comparative increase in performance. You would you would kind of flatten out over time. In fourteen, a deep learning expert named Andrew ng gave a talk at Stanford about the best use cases for deep learning, and

05:48

he mentioned that it was particularly good at supervised learning tasks. Now, these are the types of computer problems in which we humans already know the answer, such as is there a cat in this photograph? Humans can pick up on that right away, assuming someone has not carefully hidden a cat in a very busy image. But for a computer, this

06:12

is a much more difficult problem. Even if the picture has a cat center stage, it can be tough for a computer to figure that out using a supervised learning approach. With a deep learning network, you can train a system to recognize cats and images with a high degree of success if you have a large amount of training data to train the network to recognize cats. Now, in this other case, that I was talking about. The Reddit user called deep fakes started posting on Reddit in late twenties seventeen.

06:44

The user made an open source code version of a deep learning algorithm and made it available for the purposes of video manipulation and anyone could take advantage of it. Now. Specifically, this algorithm was designed for face swapping. The algorithm would allow you to put the face of one person onto the body of another in video form, and it wasn't always convincing. In fact, it could often be easily detectable as fake if someone had not trained the model properly

07:15

before creating the video. But it did open up a can of worms once the practice started getting media coverage. However, the actual technology to pull this off was already a

07:25

couple of years old when deep fakes shared it. Back in there was a group of researchers from Stanford then and also the University of Erlanger Nuremberg and the Max Planck Institute for Informatics who collectively published a paper that was titled Face to Face Real Time Face Capture and Reenactment of r GB Videos and that's a face, the number two and face. The paper details the methodology the

07:54

group used to create a pretty incredible effect. The algorithm could take the facial expressis from one person and transfer them in real time to a video target. It was like turning the video into a digital puppet. So you might have a video loop of a celebrity running, and preferably it's a loop that's easily repeatable without the repeat

08:17

being terribly noticeable, and that's your target video. So if you just let it run, you just would see video of someone sitting down, maybe looking around a little bit, but that's it, nothing special. Then you would have a source subject sitting in view of a consumer quality webcam, no special equipment here, and that person could make different expressions, including opening and closing their mouths, and the video target

08:44

would match them move for move, like a digital puppet. Moreover, the source subject didn't have to wear any special gear. They didn't have to have any special markers, none of those dots that you would see with motion capture. None of that was necessary. All the algorithm needed was a video feed from a monocular camera, so you didn't even

09:03

need depth perception for this. There's a video of their work on YouTube that shows off this process and includes a loop of George W. Bush sitting for an interview. The source subject can manipulate the face that Bush makes just by making faces of his own, and the algorithm would map those movements to the target video. And it's pretty wild to see an image of moving image of George W. Bush responding in real time to all of these different facial expressions this guy is making. So how

09:35

did the team do this? It's one thing to say a deep learning algorithm gave them this capability, but that's not really an explanation. The paper definitively spells this out in real technical detail. It starts off the explanation by saying, quote, in our method, we first reconstruct the shape identity of the target actor using a new global non rigid model BA fast bundling approach based on a prerecorded training sequence.

10:04

As this pre process is performed globally on a set of training frames, we can resolve geometric ambiguities common to binocular reconstruction. At runtime. We tracked both the expressions of the source and target actors video by a dense analysis by synthesis approach based on a statistical facial prior end quote, and it goes on in that vein throughout the paper which means it gets pretty dense. But I think we can suss out what's going on from a high level

10:32

if we just take a moment. But first, I'm going to take a moment of my own to thank my sponsor. So how did the Face to Face team build this tool? Well, for each target video, they would collect a large sample of footage and images and feed it to this deep learning algorithm. This would be necess sarry to identify all the points on the face that would move with various expressions, as well as to capture images of the inside of

11:08

the target's mouth when he or she spoke. This is because the video loop they used to create the manipulated video would feature the target subject, typically with his or her mouth closed, so it might be a section in which the subject was sitting down for an interview and listening to an interviewer's questions but not responding yet they

11:27

were just listening. The additional video would provide information about the inside of the target subject's mouth, which could be rendered in time on the target video performance when it came time to do that. Their approach improved the scanning technique to build face templates for both the source subject who provides all the expressions and the target subject, who

11:48

mimics all the expressions. As the source subject makes different facial expressions, the computer face template detects how the subjects face changes or deforms over time. The computer model then takes that information, saying, all right, well the lips moved in this way, there was a grimace here or a smile there, and transfer those motions to the targets face template,

12:15

which is matched to the target's actual face. This process transfers the expressions over to the target, and so when the source subject grimaces, the target grimaces, if the source subject just it's still the target. Video will continue to loop and the targets face won't change. The more video footage you can get of your target and your source, the better the computer algorithms are that create those face templates, and the more natural the manipulation will appear on the

12:44

finished video. You also want a really good amount of footage just to get all that extra information you need, like the inside of the mouth, so that that can all be extrapolated properly. You have to design a tool that can encode an image called the training image, and then decodes this data to reconstruct the image. So imagine you've got a picture. The encoder essentially creates data based on that image. It's like a description of that image.

13:15

The decoder takes the description and tries to rebuild the image based on the description. I think of this like that scene in Willy Wonka where Mike TV gets broken up in a million little pieces and then gets reconstructed on the television screen. So the second image is not a copy. It's not like you made a copy of the first one. It's like you built a new image based on the first one. By the way, when you start off with these uh these algorithms, those reconstructions tend

13:45

to look pretty bad. You have to continually train and train and train and train the model so that it gets better and better at producing a close representation of the original image. When it does, it's reconstru struction. And you would have both essentially decoders for both your source subject and your target subject. Use the same encoder for both, but two different decoders, one dedicated to your source, one dedicated to your target. Then you would feed the reconstructed

14:20

images through the system again and again. This is called back propagation. You do this over millions of times, typically to improve this process, and then you're ready to really switch switch things up. So let's say we've got two people. We've got person one and we've got person two, and you've been feeding images of both of these people through the same encoder, but of course you have dedicated decoders to produce the reconstruction. So person one has decode er

14:47

one and person two has decoder two. Now let's say you're ready to put person two's face on person one's body. Well, you would feed an image of person one into the encoder, but you use the decoder for a person to to reconstruct the image, and what you get is persons who's

15:08

face but mimicking the expression from person one. You, or rather the computer algorithm, does this frame by frame on video, and you end up with a video appearing to feature one person when in fact it's just their face on top of someone else, and it's their face making the exact same expressions as whoever was originally in that video.

15:32

Now back over to deep fakes. Before long after the Reddit user initially posted this code, folks over at Reddit, we're taking this open source code and making more advanced software based off of it. Soon there were desktop apps that would take over all the hard parts of this process, all the codey bits, if you will, of training a model. Some of them would guide users into creating the data that would be used to train the model and go all the way through the process of creating the final

16:02

fake videos. Even with some of the more sophisticated versions, there were tell tales signs of tampering. Typically some blurring around images, particularly near chins and mouths. Those would be signs. If there was any flicker, that was a sign if you didn't take enough time to train the model. Typically you would want to do several days of training at least. If you didn't take that time, you might see some really nasty blurring and flickering, and it would be a

16:29

dead giveaway that this was tampered. Video in writer, director, and comedian Jordan's Peel demonstrated the power of this technology. He showed how, with his impersonation of Barack Obama and some manipulation software, he could create a fake public service address when which the president would appear to say things

16:52

that he normally would never say. The technology behind this made use of what is called a long short term memory network or l s TM, to go into the mechanics of that would require another podcast, but using an approach similar to what I've already described, a team was able to make a video of Obama apparently lip syncing Peel's satirical message. The goal of this p s A was beyond alert because fakes are getting harder to spot.

17:20

The University of Washington showed off this and They're Synthesizing Obama project in which they took the audio from one of President Obama's speeches and then used it to animate his face in video from a different address that he gave during his presidency. So in this example, the person in the target video is the same person as the source for the audio. But the point was pretty clear that tech would soon make it possible to fake someone

17:51

saying or doing something. It just takes the right algorithms, the right amount of training data, and the right amount of time to get the model trained up enough to do it smoothly. Now, this technology could be used to do stuff that isn't related to malicious deception or for pornography or anything along those lines. It could be used in television and film for lots of stuff, including potentially adding in actors who have passed away into a film.

18:20

Paired with similar work that's going on in voice synthesis, you could end up with a convincing replacement, which means we could make movies with dead actors taking on new parts because we can synthesize their speech, we can synthesize their appearance. You would still have someone else acting out the part physically, but you would replace their image with this actor's image. Or maybe you would want to use this kind of technology just to make everyone think you

18:49

can cut a rug. This brings me to the University of California, Berkeley and is the subject of a paper titled Everybody Dance Now. The goal is a simple concept that's actually really hard to pull off. What if you were to take the movements of a professional dancer and then map those movements onto the body of someone who wasn't a dancer. What if you could create a video in which literally anyone would appear to move like a skilled,

19:17

trained dancer. And how the heck would that be possible. Well, at the heart of the team's efforts was something I talked about in a recent episode of tech Stuff about an AI generated portrait, and that would be generative adversarial networks or g A n s. These use a pair of artificial neural networks in competition against each other. So since I covered this recently, i'll just give again a

19:42

super quick high level summary. You've got one network that has a specific job, such as trying to create an original image of a cat. We'll go back to the cat pictures. That's one of my favorite ones because it was one of the early use cases of neural networks that I remember encountering when I was doing research. Now,

20:01

let's say you've got your second network. Your second network has the specific job of evaluating pictures of cats to determine if they are valid, meaning is this a real picture of a cat that's part of the training material that I'm accepting, or is this, in fact a fake that was created by a computer program the other neural network. So you've got one network trying to fool the other network.

20:28

And these networks get better at what they do over time, they improve, So your counterfeit network is getting better and better at making fake pictures of cats, and your detector network is getting better and better at detecting fake images of cats. Now, typically this requires humans to give feedback or tweaking weight values along the networks, but they do

20:52

get better over time. So if the network trying to create a picture of a cat gets the feedback of sorry, buddy, but they're onto you, then it can try again and adjust it's approach slightly in an effort to fool the second network. If the second network gets the feedback you'll let this one slip by and it's fake, then it will adjust or it will be adjusted to look out for any tailtale signs that it had missed in that

21:16

earlier evaluation. Over time, the two networks working against each other will create the ultimate result of better and better computer generated content, whether it's an image of a cat or a sonnet, or a song or a video. Now that doesn't mean that these computer generated things are at the same level as human generated stuff, especially when it comes to text. I've seen a lot of song lyrics

21:46

that were inscrutable even by my old man standards. So I think that we're a long way away from getting to a point where they can fool us in every case. But with video they're getting pretty darn good. Now, this team had two groups of subjects, and so you had your source subjects and your target subjects. The source in this case, were the people who could dance, so like ballet dancers, hip hop dancers and that sort of stuff. They legit know how to move. They would demonstrate various

22:19

dances on video. The second group of subjects were your target subjects. These were not trained dancers. They were to go through a series of moves and poses, essentially aping as best they could the movements of trained dancers, and the goal of this pair of networks was to smooth the movements out and adjust the timing so that these untrained dancers would appear to move more like their groovy

22:46

source subject counterparts. I'll explain more in just a moment, but first let's take another quick break to thank our sponsor. According to the Everybody Dance Now paper, the team would transfer motion between the sources to the target through an end to end pixel based pipeline. So here's how that's done. Because if you're like me, that phrase meant next to nothing to you. So specifically, the group used three stages to take the movements of one person and transpose them

23:26

to a target person. Those three were pose detection, global pose normalization, and mapping from normalized pose stick figures to the target subject. Post detection involves teaching machines, in other words, computers how to interpret images to determine where key body points are, like elbows, knees, hips, shoulders, the head, that kind of stuff. That first requires that you teach the

23:53

machine to recognize those points in the first place. So first you have to train a machine to recogniz eyes those points and identify them with a target level of accuracy. It's pretty typical to represent these joints as as points in a stick figure, so each point represents another joint or point of articulation. The lines represent the trunk of the body, the limbs, the head. You end up with

24:18

a stick figure. If your machine learning mechanism was a good one, the machine should be able to overlay a stick figure on top of any image of a person posing, and the stick figure should more or less conform to

24:30

that image, including where the actual joints are. So if you have someone standing there in the classic Peter Pan pose of their their fists on their hips uh and their their arms out of kimbo, then it should draw a stick figure that's essentially aping the same thing and be able to overlay it on top of the original image.

24:49

Now these days this can be done in real time. So, for example, there's a team at Google Creative Lab that used a machine learning model of pose net and created a JavaScript version with TensorFlow, which is an open source software library often used for machine learning. And with this tool you can do real time pose estimation through a browser and a webcam. The application doesn't have any technology

25:12

related to identifying the person in the image. It's just quote unquote interested in what the person is doing, not who the person is. So you can actually run this on your own machine in a browser, and you can pose in front of a webcam and you'll see the little stick figure uh painted on top of your image on the computer. Essentially, so every time you move, every time you bend a joint, you will see the stick figure doing the same thing, um mapped on top of you.

25:38

The Berkeley team made use of a pre trained pose detector, meaning they didn't build a new one, which helps save a lot of time and expense on their project. Now people come in all shapes and sizes. In the video the team released, they showed off subjects who included a woman who appeared to be of around average height and a man who appeared to be pretty darn Tallman transfer method that would only work between a subject and a target who are of similar shape and size would be

26:07

pretty limited. So the purpose of the global pose normalization stage is to account for all the differences between the source and the target subjects and the locations within the frame of the camera. Without this step, the motion transfer might appear ghoulish. We don't have all the same proportions, right, so a mismatch might mean a target's limbs would appear to bend in places that were clearly not natural joints.

26:35

All you need to do is see an arm bend where an arm isn't supposed to bend, and that's going to ski the out quite a bit. Makes an effective horror movie experience, but not one that would produce convincing motion transfer. Now, there are a lot of ways that the team could have gone about normalizing the poses, but

26:50

their choice seems particularly clever to me. They measured the heights and ankle positions of the various subjects and used linear mapping between the closest and farthest ankle positions in both videos to normalize the stick figure for the target subjects. The program would calculate the scale of the figure as well as the scale of motion from frame to frame.

27:14

And I think that's pretty darn cool because it wasn't just accounting for the size of the subjects to get all the joints right, but also to make sure the scale of the movements with respect to the body size and proportions would remain the same. So a tall person with really long limbs moving their arms in really big, big, bold gestures, if you tried to transfer that motion to someone who was of smaller stature, it could really look disturbing.

27:43

But by using this scaling approach, the movements on the smaller person would be proportionate in size to the movements of the larger person. The team would use two of the Generative Adversarial Network setups to work on making a

28:00

convincing final video. The first was dedicated to image to image translation, attempting to manipulate the image of the target subjects that would follow the motions made from the pose detection process, and like all g a N setups, this included the generator, which would attempt to create a convincing sequence of images, and the discriminators, which tried to weed out the quote unquote fake sequences from the generator from the ground truth data that was being fed to it.

28:28

The second g N set set up was specifically dedicated to add detail and realism to the faces of the target subjects. In some frames this appears to have worked pretty well, and others there's a bit of an uncanny valley thing or maybe even horror movie type element going on, similar to how some of the AI generated portraits that I talked about in the previous episode introduced a bit

28:52

of unrealistic qualities to the various images. When shooting video of the target subjects, the team captured images at one hundred twenty frames per second to get enough data for each subject. The sessions lasted for about twenty minutes. They used smartphone cameras to do it, since many smartphones allow you to shoot video at this kind of frame rate

29:12

these days. They had their target subjects where close fitting clothing that wasn't prone to wrinkling because the post recognition tool they were using wasn't designed to encode information about clothing. As for the source videos, the ones that would actually create the motions that would be transferred to the targets, the team didn't have to worry about capturing images at

29:32

such a high frame rate. They could use videos of just reasonable quality, meaning decent resolution and frame rate, and their post detection tool would do its work and create the stick figure that would serve as the guide for the target motions later on. Because of that, the team can really use any online video of sufficient quality to act as the source information for motion transfer. It doesn't have to be a video shot specifically for that purpose.

29:59

In fact, one of the example videos the team used in their demonstration was from a Bruno Mars music video for That's what I Like. Before applying the motion transfer, the team smoothed pose key points to reduce jitter in the final output, and then the team applied the motion transfer. The stick figure motions were then transferred to the target subjects and the result is pretty interesting. It is not seamless.

30:26

You can definitely tell something odd is going on, but it is an indication of where things are going and using adversarial networks could lead to more convincing motion transfers in the future. Now, this could lead to all sorts

30:40

of stuff nefarious and otherwise. You could imagine using it to transform an average actor into a martial arts master, or it might allow directors more freedom of casting, knowing that if the actors they choose don't possess certain physical skills, they can use this kind of technology to fake it, but would also be used to fake footage to make it looks like people like specific people are doing stuff

31:06

that they are not doing. It could be used to spread misinformation and it likely will be, which means we'll need to be on the lookout for signs of fakes, which are going to get harder and harder to detect as time goes on. And hey, you guys remember DARPA, right because I just did a whole series of episodes about them. Well, that agency has funded programs dedicated to automating various forensic tools, including tools that could be used to detect AI created forgeries in video and audio. Often

31:39

the secret is in the eyes. Most of these neural networks are trained on still images, so you send thousands or tens of thousands of images if you have them, of your various subjects, your target, and your source. But most published still images don't show people with their eyes closed. So I've moved my movements and blinking tends to be a little wonky in these fake videos. You might watch one for a while and think, huh, that's weird. This guy hasn't blinked for like ten minutes, or when they

32:11

blink it looks really strange. Well, that's an indication that it's a fake video. There are other ones as well, but DARPA is understandably keeping those quiet because not you know, if if they publish how they figure out AI created videos are in fact faked, then that gives the fakers enough information to go back and improve their models. So we're likely to see something akin to what happened with capture. Specialists will develop new tools to detect a I generated media.

32:45

AI developers will then create more sophisticated models, and so it becomes kind of an arms race a seesaw, and one benefit is that AI as a whole will improve, but we may not be able to believe it when we see it. Well, that wraps up this episode of fascinating, somewhat disturbing topic, and uh, I'm sure we're gonna hear a lot more about this in the years to come. We've seen a lot of of sites banning deep fakes

33:14

outright because of the misinformation that they can spread. So we're already seeing a reaction to this in various online communities, so that's very interesting to me. But we're definitely gonna keep seeing this continue. It's a it's a valid area of AI research, so we will have to wait and see how it all plays out. If you guys have any suggestions for future episodes of tech Stuff, why not send me a message. You can go over to our website that is Text Stuff podcast dot com. You'll find

33:47

all the different ways to contact me. I look forward to hearing from you. Make sure you check out our store over at t public dot com slash tech Stuff by some merchandise. You can make sure that you get all the really cool T shirts like prove to Me You're not a Robot. That one's pretty appropriate for this particular episode. And remember every single purchase you make goes to help the show, so we greatly appreciate it. Also, if you haven't heard, we have been nominated for an

34:14

I Heart Radio Podcast Award. It's the first year I Heart Radio is giving out podcast awards. We are nominated in the Science and Technology category. You can go online and visit the I Heart Radio Podcast Awards page and vote up to five times a day for your favorite podcasts. If you wanted to. You could dedicate all five of those votes every single day to us. I would not complain if you did that. It would be really cool to win that award, but make sure you check it out.

34:43

There may be lots of shows there that you truly love and you want to throw your support behind them. That would be really cool with you, And I'll talk to you again really soon for more on this and bousands of other topics because it has to have four. Stock com

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript