Monologue: Don't Be Scared Of Sora

Speaker 1

00:03

Media. All right, Matt, I've read the YouTube comments and this time I want it so you do not cut me off with the music too fast. Okay, good right, all right, let's go. This is this week's Better Offline monologue. And I'm ed Zich. A lot of you have been saying you want me to do something about Sora, and if I'm honest, I haven't wanted to because I fund

00:32

the whole thing is so utterly pathetic. A few weeks ago, open ai launched a half baked social networking app attached to a compute intensive video and audio generator, and people immediately began to do two things free count and generate as many copyright violations as humanly possible, all because of open AI's original plan was to ask copyright holders to opt out of having their content presented in these videos. Sora spent several days covered in Nazi spongebobs and pickagews

00:57

with guns before multiple Hollywood talent agents. He's, along with the estate of Martin Luther King Junior, intervened the complained, leading to open ai creating, to quote MPR an opt in policy allowing all artists, performers, and individuals the right to determine how and whether they can be simulated with open AI, blocking the generation of well known characters on its public feed and offering to take down material not in compliance. It's unclear what happened with nintender, but I

01:21

imagine one of their seventy million lawyers attacked. And now we've got that out of the way, let's talk about SORA itself. I understand a lot of the people who listen in film and TV they're kind of scared. And I understand that you've seen a few clips that look kind of sort of realistic, and that this, especially if you're in the creative arts, is quite terrifying because your mind naturally assumes that these clips can be strung together into some sort of coherent whole. This isn't the case.

01:45

Every single good, and I use the term loosely, SARA video is cherry picked for many, many, many terrible generations. Every time you use SORA is random. It doesn't matter how specific your prompt is or however many times you've used it. SAA is effectively a giant video and audio slot. You can never ever guarantee that Sorrow will generate something useful, and as a result, can never really budget for using it. The human eye is remarkably demanding and little visual inconsistencies

02:11

between scenes will make people feel weird and uncomfortable. Imagine that extrapolated to ten or fifteen seconds at a time, and how difficult it will be to get something that makes visual sense before you have to think about things like does this connect to the rest of the footage I'm using? Okay, So the majority of actual professionals who would use Sura would not be using the app. They'll be connecting directly to the model on open ais API.

02:33

It's just it's not done via a classical app interface. Now, then there's the problem of cost. This is where you really need to start worrying if you're building things with Sourer. So let's start off with the first problem. Cost. So

02:48

open ai offers two different Saur models. Sorra two, which they say is designed for speed and flexibility and is ideal for the exploration phase, and that costs ten cents per second, and then there's Sora two pro, which is either thirty cents or fifty cents a second depending on resolution, and I quote it's the thing you go to for production quality outputs. So you're either spending one, three, or

03:10

five dollars for every ten seconds of footage. And like every generative model, the longer you generate, the higher the likelihood of hallucinations, which in the case of Soro, means bizarre animations, inconsistent details, or just flat out useless crap. Then there's the problem of time. Open AI's own documentation says that a single render may takes several minutes. At the end of those several minutes, out pops a video

03:31

that may or may not be of any use. Open Ai allows you to remix using more prompts, which allows some iterative development, but these remixes also cost money and also take several minutes. So let me walk you through a scenario. You're making a short film. Let's just say it's fifteen minutes long, which is nine hundred seconds. You ask Zora to generate a man putting on a hat.

03:52

Your first eight generations each taking four minutes and five dollars apiece, which takes about thirty two minutes and forty dollars. I don't really do the job, so you do two more, taking another four minutes apiece and ten more dollars. You finally on the next try get something kind of useful, which cost you another five dollars, and then you realize you wanted him to wear a specific kind of hat.

04:13

This happens all the time when directing stuff. There are minor changes you make that you realize when you're finally in the moment, would look or sound or be better. So, yeah, that doesn't go so well with probabilistic models. So shit, fuck, you gotta do something, so you remix in another four minutes, another five dollars. Fuck. Wrong hat, four minutes five dollars. Right, hat is hand blends through it for some reason. Okay, four minutes, five dollars. The hat's right, but when he

04:39

puts it on, his eye blinks. One of his eyes just blinks three times for some reason, so you can't really use that. Okay, four minutes five dollars. Looks kind of good. Different hat again, four minutes, five dollars. Hmmm, you've now spent eighty dollars in over an hour generating a man trying to put on a hat. You're not

04:56

really much closer to having useful footage. And because as you remix it again and again, keeps making these little errors, because that's how these models go, it's impossible to tell whether the next generation will be the one that works or whether sorrow will spit out some new little fuck up. So the more intricate something is, the more expensive it gets. But you know what, you can find money places you

05:18

can't find more goddamn time. I guess you could have a separate computer running more, but that's still gonna cost a bunch of money. How many of these slot machines are you gonna run at once? How many times are you going to allow them to edit? How can you have a coherent vision when you've got multiple people generating things? You can't. But you know what, perhaps perhaps the next generation will be great, or perhaps it will be dogshit. You have no way to know, because that's the magic

05:43

of generative AI. Yet these problems compound aggressively once you need any kind of visual consistency. The man now has to put the hat on and leave the house. How does the house look? Is the hat the same? Does he have wallpaper on his walls? Is there anyone else in the house? What kind of table? Two chairs, one chair, five chairs? How do you possibly keep all of these things consistent? You don't, You can't. That's part of what

06:06

makes SAURA so goddamn awful. It's built specifically to make you scared of them, to create superficially impressive clips, so that brain dead Hollywood executives can claim it's the future. Yet in a practical sense, it's impossible to budget, or plan or guarantee anything about what SAURA might do. And this is pretty much across the board for these generative

06:26

models making video and audio. Now, I've heard from a few people that SAA is cheaper because it doesn't involve labor, which is something you could say only if you believed SAURA would give consistent outputs. And really, the only thing that a probabilistic model like SAURA can do is guarantee inconsistency,

06:43

even by Hollywood accounting standards. A generative tool that will cost hundreds or thousands of dollars to generate ten seconds of shitty footage that is impossible to coherently connect to more footage is a really terrible idea and also very inconsistent in its costs too. And like I said earlier, there's the issue of time. Every single entertainment product requires some sort of time budgeting, and it's impossible to say

07:06

how long it will take SAURAW to generate something. Open Eye doesn't even specify what several minutes means, meaning you can't really plan a production using it. SARA isn't cheaper, SAWRA isn't easier, and SARA certainly isn't more efficient. But you need to remember also that generative video models have been around for over a year and they're not really seeing mass use now. If this thing were capable of making anything truly useful, you'd see it everywhere right now.

07:33

But you are seeing a little bit of it. And I do want to address that you probably saw cal She's ad and heard that it costs two thousand dollars to make and took only a few days, But I really encourage you to look at the actual commercial itself. It's completely incoherent nonsense. Each shot completely disconnected with weird glitches and animations in the crowds, and one point towards the end, a woman is meant to say okay, see, but the sea part does not map to her mouth.

07:56

It looks really bad and the only way you could get away with something like this is having these quick hit shots. And also please go and view the comments about this that people just rip the fuck out of this thing. But nevertheless, it was made using VO three, Google's generative video model, and it apparently took three hundred to four hundred clips to get fifteen usable shots stitched together using traditional editing tools. Now, the reason this costs

08:19

two grand is that it sucked. And the reason you're not seeing more advertisers do this is because it's impossible to make a coherent video out of this footage. I realize most commercials you see on TV may feel chaotic or kind of bland, but they're remarkably precise, and the generative shots used for the Cawshi commercial are chaotic and failed to convey any real meaning beyond a person yelling Indiana or OKC. The only reason it cost so little was one guy put several days of prompting it to

08:45

it and the end result was shitting cow. She didn't mind because this was a publicity move. Cow. She put out the commercials specifically so the media would write it up, and they succeeded because the media loves to feed on scary stories like AI is going to replace human actors. Since the calshe ads pjas who made it has made a few others a Popeyes wrap one where again go and look at the comments. I'm not linking to it, by the way, I don't want to send them any

09:08

fucking traffic. But the Pope is one. People are just responding saying, this looks like shit, what is this? It's incoherent, it's inconsistent. But the funniest one I found was David Beckham's iomate health supplement ad, which ends with a shot of the bottle of the product with a bunch of garbled generative texts. It does not appear that PJAS has got a ton more work than this, probably because the outputs kind of suck and brands really do not like

09:32

inconsistent things. And also a fucking health supplement from David Beckham. Jesus Christ just say it's a private equity film anyway. To conclude, I also want to be clear that the rates for these videos are heavily subsidized by big tech, just like every other generative AI product or saw a might cost thirty or fifty cents a second right now. Once the AI bubble burst, these prices who will either skyrocket or these models will cease to exist for public consumption.

09:58

The biggest clue I can give you is Google only allows you to generate four or five VO three videos a day on their two hundred and fifty dollars a month Gemini ultra plan. That suggests that Google's video costs a brutal and the open aiye is burning money by the bucket for to let you fuck around on the sau app. I don't recommend you do that, but if you have just no, you're burning a hole in Clammy Sammy's pocket. I will add that you may worry about

10:20

these models getting better. While they might be more nuanced than their ability to generate video in five or ten second bursts, their ability to generate longer or consistent videos is inherently impossible due to the probabilistic nature of transformer based models. In simple terms, these things are rolling the dice every time. The way you prompt them is what makes them generate, and they don't have minds or thoughts. They're just rolling the dice every time on whatever you

10:45

say and trying to interpret what you mean. Human beings, by the way, are extremely magical. I think you really underestimate how amazing people are. When we direct someone on a film set, even like an assistant director. That person keeps the product moving and make sure everyone gets what they need and pushes back in a director when something might be impractical. A director is a visionary, but also an actor is someone that takes interpretation and then is

11:10

directed to do different things. But that direction is not a fucking prompt move your elbow, look look at this way, look that way. The things that operate on a film or TV set are inherently different to just plugging words into a fucking model, and I get them. I get everyone in Hollywood who's scared right now. I get everyone in creatives, in creative arts even who is scared right now. I feel for you. These people are losing. These people

11:37

are losing. This stuff does not work, it's inconsistent, it's incredibly expensive on subsidized rates, and in the end, I really really believe that once the bubble pops, these things are going away. Thank you so much for listening. Reach out if you have any thoughts. I always love to hear from people. E Z at better offline dot com. I love getting your emails. I love getting your your weird little missives on Reddit. I really am I'm truly blessed, and I love you all. I love how many of

12:07

you listener. I love how communicative you are. It's been a big week with the Anthropic exclusive, and yeah, I'm gonna have already a better offline next week as well. Crap, I've got a good do an episode. Shit damn. Oh well, I have the best job in the world anyway, Thank you for listening.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript