Back to Basics: Let Denoising Generative Models Denoise

Best AI papers explained

Nov 23, 2025•15 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This academic paper, introduces "Just image Transformers" (JiT), a novel approach to denoising diffusion models that advocates for directly predicting clean data (**x-prediction**) rather than predicting noise or a noised quantity. The authors argue this shift is critical based on the **manifold assumption**, which posits that clean data lies on a low-dimensional manifold while noise is inherently off-manifold. Experiments, including a toy model and high-resolution ImageNet generation using plain Vision Transformers (ViT), demonstrate that x-prediction successfully handles high-dimensional spaces where conventional noise-predicting methods catastrophically fail. This research emphasizes a return to first principles for a self-contained **"Diffusion + Transformer"** paradigm on raw pixel data, without relying on complex architectures, pre-training, or auxiliary losses. Ultimately, the paper provides extensive ablation studies on loss combinations and architectural components to validate that **x-prediction** is fundamentally more tractable for limited-capacity networks in high-dimensional generative modeling.

For the best experience, listen in Metacast app for iOS or Android