Packing Input Frame Context in Next-Frame Prediction Models for Video Generation - podcast episode cover

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Apr 19, 2025•24 min•Ep. 692
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

🤗 Upvotes: 24 | cs.CV

Authors:
Lvmin Zhang, Maneesh Agrawala

Title:
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Arxiv:
http://arxiv.org/abs/2504.12626v1

Abstract:
We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.

For the best experience, listen in Metacast app for iOS or Android