Revolutionizing Vision and Language Models: Depth Prediction Breakthroughs, Pixel-Level Transformers, and Robotic Skill Learning
Jun 20, 2024•13 min•Ep. 52
Episode description
Depth Anything V2
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Transformers meet Neural Algorithmic Reasoners
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
Language Modeling
OpenVLA: An Open-Source Vision-Language-Action Model
Alleviating Distortion in Image Generation via Multi-Resolution
Diffusion Models
For the best experience, listen in Metacast app for iOS or Android
