AI Papers Podcast

A daily update on the latest AI Research Papers. We provide a high level overview of a handful of papers each day and will link all papers in the description for further reading. This podcast is created entirely with AI by PocketPod. Head over to https://pocketpod.app to learn more.

Last refreshed: December 3rd, 2025 at 6:03 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

BigCodeBench Challenges, Cambrian-1 Leap, D-MERIT's Evaluation, Long Context Breakthrough in Vision

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Evaluating D-MERIT of Partial-annotation on Information Retrieval Long Context Transfer from Language to Vision...

Jun 27, 2024•11 min•Ep. 55

LongRAG Breakthrough, LLMs as Judges, Transformer Memory Insights, Video Library AI, Democratizing Art Styles

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task Towards Retrieval Augmented Generation over Large Video Libraries Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models...

Jun 26, 2024•10 min•Ep. 54

Scaling In-Context Reinforcement Learning, ChartMimic's AI Benchmark, Multimodal Document Comprehension, Long Context Reasoning Challenges

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning Make It Count: Text-to-Image Generation with an Accurate Number of Objects ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Needle In A Multimodal Haystack BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack...

Jun 21, 2024•11 min•Ep. 53

Revolutionizing Vision and Language Models: Depth Prediction Breakthroughs, Pixel-Level Transformers, and Robotic Skill Learning

Depth Anything V2 An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels Transformers meet Neural Algorithmic Reasoners Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling OpenVLA: An Open-Source Vision-Language-Action Model Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models...

Jun 20, 2024•13 min•Ep. 52

NaRCan Revolutionizes Video Editing, Training-Free Video Generation, Recaptioning Web Images with LLaMA-3, Novel Data Synthesis Approach, Smartphone LLM Inference

NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing MotionClone: Training-Free Motion Cloning for Controllable Video Generation What If We Recaption Billions of Web Images with LLaMA-3? Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing PowerInfer-2: Fast Large Language Model Inference on a Smartphone...

Jun 16, 2024•12 min•Ep. 51

Revolutionizing Image Synthesis with TiTok, Multilingual Code Benchmark, Exploring GenAI Prompting Techniques,

An Image is Worth 32 Tokens for Reconstruction and Generation McEval: Massively Multilingual Code Evaluation Zero-shot Image Editing with Reference Imitation The Prompt Report: A Systematic Survey of Prompting Techniques TextGrad: Automatic "Differentiation" via Text...

Jun 15, 2024•11 min•Ep. 50

LlamaGen's Image Revolution, Husky: The Multi-Step Reasoner, Vript's Video Breakthrough, VALL-E 2 Achieves Human Parity

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning Vript: A Video Is Worth Thousands of Words Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers...

Jun 14, 2024•11 min•Ep. 49

Mixture-of-Agents, Benchmarking LLMs, and GenAI Arena Evaluation

Mixture-of-Agents Enhances Large Language Model Capabilities WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild CRAG -- Comprehensive RAG Benchmark GenAI Arena: An Open Evaluation Platform for Generative Models Large Language Model Confidence Estimation via Black-Box Access...

Jun 12, 2024•11 min•Ep. 48

Enhancing AI Video and Image Generation, BitsFusion Quantization, Step-aware Optimization, Thought-Augmented Reasoning, and Single Forward Video Generation

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions BitsFusion: 1.99 bits Weight Quantization of Diffusion Model Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models SF-V: Single Forward Video Generation Model...

Jun 11, 2024•12 min•Ep. 47

AI Papers Podcast Special Edition: Apple Intelligence & Ferret-UI

Apple announced new Siri features and Apple Intelligence today, Interestingly, Apple already released a paper, titled "Ferret-UI," on how it all works - a multimodal vision-language model capable of understanding widgets, icons, and text on an iOS mobile screen, and reasoning about their spatial relationships and functional meanings. https://arxiv.org/abs/2404.05719

Jun 10, 2024•2 min•Ep. 46

Block Transformers: Faster Inference, Mobile Device AI Agents, 3D-Image Generation, Low Latency TTS

Block Transformer: Global-to-Local Language Modeling for Fast Inference Parrot: Multilingual Visual Instruction Tuning Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes...

Jun 10, 2024•11 min•Ep. 45

Seed-TTS, Decoding LLMs, Innovations in Text-to-Video, Self-Improving AI Preferences, and Refining Diffusion Models

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models To Believe or Not to Believe Your LLM I4VGen: Image as Stepping Stone for Text-to-Video Generation Self-Improving Robust Preference Optimization Guiding a Diffusion Model with a Bad Version of Itself...

Jun 07, 2024•11 min•Ep. 44

MMLU-Pro: Next-Level Language Understanding, Tailored LLMs, High FPS Video Generation Innovation

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark Learning Temporally Consistent Video Depth from Video Diffusion Priors Show, Don't Tell: Aligning Language Models with Demonstrated Feedback Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation...

Jun 06, 2024•12 min•Ep. 43

Transformers and State-Space Models Unite, Multi-modal LLM Benchmark, Perplexity in Data Pruning, Advancing 4D Content Generation

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling 4Diffusion: Multi-view Video Diffusion Model for 4D Generation...

Jun 05, 2024•10 min•Ep. 42

DITTO-2 Speeds Up Music AI, GECO's Quick 3D Generation, PLA4D's 4D Advances, DevEval's Real-World Code Benchmark, Parrot's LLM Application Efficiency

AI Papers Podcast for 06/04/2024 DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation GECO: Generative Image-to-3D within a SECOnd PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories Parrot: Efficient Serving of LLM-based Applications with Semantic Variable...

Jun 04, 2024•11 min•Ep. 41

Boosting Text Retrieval with CLIP Models, Rethinking Retrieval Augmented Generation, and Deciphering Human Behavior through MotionLLM

AI Papers Podcast for 06/03/2024 Jina CLIP: Your CLIP Model Is Also Your Text Retriever Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts MotionLLM: Understanding Human Behaviors from Human Motions and Videos Xwin-LM: Strong and Scalable Alignment Practice for LLMs MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model...

Jun 03, 2024•11 min•Ep. 40

Bilingual LLM Transparency, T2V-Turbo's Video Generation, LLMs Surpassing Human Theory of Mind Performance, Advancements in LLM Attribution

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback LLMs achieve adult human performance on higher-order theory of mind tasks Nearest Neighbor Speculative Decoding for LLM Generation and Attribution Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities...

May 31, 2024•9 min•Ep. 39

Phased Consistency Model, 2-Stage Backpropagation, and the Future of 4D World Reconstruction

Phased Consistency Model 2BP: 2-Stage Backpropagation GFlow: Recovering 4D World from Monocular Video Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models...

May 30, 2024•8 min•Ep. 38

Vision-Language Models, Arithmetic Transformers, Next-Gen Video Editing:

An Introduction to Vision-Language Modeling Transformers Can Do Arithmetic with the Right Embeddings Matryoshka Multimodal Models I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models Zamba: A Compact 7B SSM Hybrid Model Looking Backward: Streaming Video-to-Video Translation with Feature Banks...

May 29, 2024•10 min•Ep. 37

ConvLLaVA's Visual Compression, Efficient LLVM, Multilingual Aya 23, and AutoCoder's Code Mastery

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization Aya 23: Open Weight Releases to Further Multilingual Progress Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct...

May 28, 2024•11 min•Ep. 36

Revolution in Image Generation, Thermodynamic Gradient Descent, DMD2 for Fast Synthesis, Distributed Speculative Inference

May 26, 2024•11 min•Ep. 35

Language Model Mysteries, Personalized Image Generation, Audio-Visual Transformer Innovations, DeepSeek-Prover, Dense Connector: MLLM Potential

ReVideo: Remake a Video with Motion and Content Control Not All Language Model Features Are Linear RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data Dense Connector for MLLMs...

May 24, 2024•11 min•Ep. 34

Transformer Linearity, Face-Adapter Diffusion Models, Cross-Layer Attention Shrinks LLMs, Image Generation Breakthrough

Your Transformer is Secretly Linear Diffusion for World Modeling: Visual Details Matter in Atari Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control Reducing Transformer Key-Value Cache Size with Cross-Layer Attention OmniGlue: Generalizable Feature Matching with Foundation Model Guidance Personalized Residuals for Concept-Driven Text-to-Image Generation...

May 23, 2024•10 min•Ep. 33

Infinite Video Generation, High-Rank Fine-Tuning, Modular LLMs with LoRA Libraries

FIFO-Diffusion: Generating Infinite Videos from Text without Training MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Imp: Highly Capable Large Multimodal Models for Mobile Devices Octo: An Open-Source Generalist Robot Policy Towards Modular LLMs by Building and Reusing a Library of LoRAs...

May 22, 2024•9 min•Ep. 32

Tailoring Language Models for Science, Scaling Laws in NLP, Grounded 3D-LLM Innovations, Efficient Large Model Inference

INDUS: Effective and Efficient Language Models for Scientific Applications Observational Scaling Laws and the Predictability of Language Model Performance Grounded 3D-LLM with Referent Tokens Layer-Condensed KV Cache for Efficient Inference of Large Language Models Dynamic data sampler for cross-language transfer learning in large language models...

May 21, 2024•10 min•Ep. 31

Chameleon's Multimodal Breakthrough, LoRA's Learning Efficiency, Many-Shot In-Context Learning, Object Detection Innovation, Text-to-3D Generation

Chameleon: Mixed-Modal Early-Fusion Foundation Models LoRA Learns Less and Forgets Less Many-Shot In-Context Learning in Multimodal Foundation Models CAT3D: Create Anything in 3D with Multi-View Diffusion Models Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode...

May 18, 2024•11 min•Ep. 30

Efficient Multimodality, Vision Suite's Custom Data, EEG Music Decoding Advances, Mobile Video Breakthrough

ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation Naturalistic Music Decoding from EEG Data via Latent Diffusion Models No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding...

May 17, 2024•9 min•Ep. 29

Transformer Models Beyond Scaling, Multilingual Image Synthesis, Advanced Text-to-Image Control

VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding Compositional Text-to-Image Generation with Dense Blob Representations...

May 16, 2024•9 min•Ep. 28

Vision-Language Model Design, Online RLHF Workflow, Multilingual AI, AI Memory Solution

What matters when building vision-language models? RLHF Workflow: From Reward Modeling to Online RLHF SUTRA: Scalable Multilingual Language Model Architecture SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots...

May 15, 2024•10 min•Ep. 27

BlenderAlchemy Revolution, Stylus Adapter Magic, DressCode Digital Fashion

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models Stylus: Automatic Adapter Selection for Diffusion Models Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations DressCode: Autoregressively Sewing and Generating Garments from Text Guidance PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning...

May 14, 2024•10 min•Ep. 26

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android