#496 Neil: AI Video Creation Method That Keeps Every Scene Consistent

00:00

Imagine directing an entire cinematic movie where your only camera is a keyboard. Beat. No set. No crew. Just you. Right. Just you. And a machine learning to dream on command. Welcome to this deep dive. I am so glad you were here with us today. Yeah, thanks for having me. I'm excited for this one. We are going to take our time with this topic. We're slowing things down just a bit to break apart a highly structured, repeatable workflow for AI video creation. Which is something

00:30

a lot of people desperately need right now. Absolutely. We're exploring how to combine two powerful systems, Google Flow and Claude AI. The goal here is to bypass that chaotic learning curve of AI video and actually build consistent, controllable scenes. I mean, it's a huge shift. We aren't just looking at software tools today. We are looking at a complete architectural process. First, we build the writer's room. Then we construct the visual blueprint. And finally, we direct the animation

00:56

piece by piece. To really understand this workflow, we first have to understand why beginners usually fail. because it comes down to choosing the wrong type of tool from the absolute start. Yeah, that is the great AI video divide. There's basically this steep learning curve that stems from fragmented tools. Right, you have individual models. Exactly, models like Kling or Runway. They give you incredibly deep control, but they require juggling multiple

01:22

platforms and... multiple subscriptions. It gets overwhelming. It does. But then, there are all -in -one workspaces. Google Flow solves that juggling act by combining three specific models into one place. So what are those three? Well, you've got Nano Banana Pro for your images, you've got VO3 .1 for your video generation, and then Gemini Omni. Which brings in multimodal editing. Yes. Which, just to define that simply, means combining text, images, and video to guide the

01:51

AI. Right, exactly. But the really interesting part is Claude AI's role in all this. Yeah, because Claude doesn't actually generate any art, right? No, not a single pixel. It sits alongside Google Flow, purely as a scene planner and a prompt writer. I kind of like to compare this setup to a traditional movie set. Oh, yeah. Yeah, so Google Flow is your director of photography, and Claude AI is your head writer. That's a perfect

02:12

way to look at it. But let me ask you this. If Google Flow is a true all -in -one tool, why rely on an external text model like Claude at all? Well, Google Flow is optimized for manipulating visual data. Claude is just vastly superior at maintaining narrative logic over long context windows. So Claude handles the logic, freeing Google Flow to focus purely on the visuals. Precisely. You let the writer write, And the camera film.

02:39

Makes sense. So now that we have our writer and our DP, we have to actually teach the writer how to talk to the DP. Yeah, and this is where people get stuck. I mean, I still wrestle with prompt drift myself where my text instructions just get messy over time. Oh, absolutely. Writing prompts from scratch every time just leads to inconsistent garbage. The AI loses the thread. Right. So the solution is creating a custom skill inside Claude AI. You literally call it the AI

03:05

video prompt writer. How does that actually work in the interface? You just go to customize, then skills, then create with Claude. And you paste in exact instructions to always generate three specific types of prompts. Three types. OK, what's the first one? Number one is the design sheet that covers your characters, your props, the overall style. Number two is the storyboard. That's your panel by panel camera angles. And number three is the scene prompt, which are the

03:29

direct instructions for VO 3 .1. That's so structured. The source actually uses this great example. A woman and her dog escaping Manhattan during a zombie outbreak. Yeah, a classic setup. You literally just typed that one simple sentence and Claude generates all the structured technical prompts for you. It really does. It removes all the friction. But wait, why do we need three separate highly specific prompts instead of just one master prompt describing the whole video?

03:56

Because if you feed a model a massive block of text, it simply drops details. It hallucinates. It can't balance all those variables at once. Breaking it down prevents the AI from getting overwhelmed and mixing up complex instructions. Exactly. You have to segment the cognitive load. OK. So with Claude generating these text blueprints, we must translate those into a visual foundation before generating any video. Right. And this is where we move into Google Flow, specifically

04:24

using Nano Banana Pro for images. So we are building the design sheet next. Yes. You must definitively establish the world. What does the main character look like? The dog. The zombie. What are the clothing and props like? You're defining the color palette too. Everything. And here's a crucial tip. Start with low resolution. Oh, to save generation credits. Yeah, it saves credits and it generates way faster. You use that low -res draft to check for mistakes. If an element is missing, you don't

04:54

fix it in Google Flow. You go back to Claude. Right. You go back to Claude for text revision. And once the image is perfect, then you render it in full resolution. That is smart. So then we move to the storyboard. We use Claude to generate a 12 -panel storyboard prompt. Like the convenience store, the zombie attack, the escape. Right. And we generate this in Google Flow using the design sheet as a visual reference to lock in that consistency. You have to attach it. It's

05:20

mandatory. So what actually happens under the hood if you get impatient and skip the design sheet step? The model just invents a completely new reality for every panel. The woman and the dog will look different in every single shot. Without it, the AI literally forgets what your characters look like between every shot. Yeah, it has zero object permanence without that visual anchor. Sponsor. Okay. So we have our storyboard. We have our character DNA. Now we finally step

05:46

onto the stage to make things move. The fun part. Switching over to VO 3 .1 for video generation in Google Flow. But before writing the scene prompt, the guide says you must upload two references. Yes. First, the design sheet for your world and character consistency, and second, the specific storyboard panel image. Which locks in your framing and composition. Exactly. Then, and only then, do you feed VO3 .1 the scene prompt from Claude.

06:13

So for the first four panels, it's walking through the dark store, looking nervous, and then a zombie jumps out. Right. Whoa. I just have to pause and think about this. Imagine scaling to a billion queries across the globe. But here we're just intimately tweaking one single perfect frame of a zombie attack. It's wild. It really is a staggering amount of compute power, just focused on the shadow of a zombie in an aisle. But it doesn't always come out perfectly on the first

06:42

try. No, definitely not. So we have to use this review and improve methodology. You do not throw away a whole clip if one cut looks wrong. Never. A weak first output is totally normal. You generate a second version with specific instructions. Like telling it to do a direct cut to a close -up. Exactly. And then you combine the strongest parts of both outputs in post. But why can't VO3 .1 just recognize a bad cut and fix it automatically in a single generation? Because it doesn't understand

07:11

human anatomy or cinematic timing. It just predicts pixel patterns based on data. The AI lacks human taste. It needs us to stitch the best parts together. Right, you have to be the editor. So generating one good scene is great. But a 12 panel story will fall apart if you try to render it all at once. We have to control the AI's pacing. Yeah, you cannot do all 12 panels at once. This introduces the rule of chunking. Which means animating only four panels at a time. Right. Trying all 12 overwhelms

07:39

VO 3 .1. It causes random hallucinatory transitions. It's like staffing Lego blocks of data. You do it piece by piece so the whole thing doesn't topple. I love that analogy. So for each chunk, you need three references uploaded. Yes. Number one, the cropped storyboard row. Number two, the design sheet. And number three, the final frame of the previous clip. That last one seems like a real secret weapon. Oh, it absolutely is. What about fixing weak cuts? Like, if a character

08:08

shifts positions suddenly mid -scene. If that happens, instruct Vio to hold on a close -up before cutting to the next action. It hides the error. And again, you combine the best elements. I want to go back to that third reference for a second. Why is feeding the final frame of the previous clip back into the machine so critically important? Because it creates a rigid anchor in time. It prevents the model from subtly changing

08:30

the lighting or the camera distance. It forces the new scene to mathematically lock into the last clip's ending. Perfectly said. Down to the exact pixel. Even with chunking, the entire project can unravel if you forget the overarching philosophy of AI video. Visual consistency over everything else. Without a doubt. The most common reason videos fall apart isn't bad text prompts. It's losing visual consistency. The characters change. The environments shift. Yeah, it just looks amateur.

08:58

So rule number one is that the design sheet is the visual anchor. Keep it attached as a reference for every single generation. Let's talk about some mistakes that just burn time and credits. Number one. generating from text only. That forces the AI to guess entirely on its own. Number two, animating more than four panels at once. We covered that. It destroys the pacing. Number three, skipping the revision process or just expecting a perfect first output. You have to think like an editor.

09:26

It's an iterative process. And number four, moving forward with an incomplete design sheet. Right, because any error there propagates into every single scene that follows. I do want to push back gently on that first mistake, though. Since text is how we naturally talk to AI, why is generating from text only considered such a massive mistake? Because human language is just too imprecise. If you say dark store, that means a billion different pixel variations to the model. Words are too

09:53

ambiguous for video. Visual references provide the only undeniable truth. Exactly. You have to show it, not just tell it. So to recap this whole structured journey, AI video isn't luck. It's a process. It really is. You build the design sheet, you map the storyboard, you use Claude for precise prompts, you generate in small four -panel chunks with VO, and always, always keep your visual references attached. If you're going to try this for the first time, my advice is

10:23

to keep it incredibly simple. One character, one location, one action scene. Master the workflow before building a complex epic. Walk before you run, for sure. It's just wild to think about where this is heading. Pete. If these dual AI tools can synthesize a cohesive, terrifying zombie escape from just a few structured constraints, Beat, what happens when this workflow gets fully

10:45

automated? Yeah, that's the big question. If the AI eventually learns to manage its own visual consistency between shots, what exactly becomes the human director's role in the filmmaking of the future? Yeah, Beat, it's something to think about. Thank you so much for joining us on this Deep Dives. Take care. UT over music.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript