Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers - podcast episode cover

Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

Jul 24, 2025•19 min•Ep. 1000
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

🤗 Upvotes: 27 | cs.CV, eess.IV

Authors:
Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun

Title:
Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

Arxiv:
http://arxiv.org/abs/2507.08422v1

Abstract:
Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0$\times$ speed-up on FLUX and 3.0$\times$ on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.

For the best experience, listen in Metacast app for iOS or Android