AI Papers Podcast

A daily update on the latest AI Research Papers. We provide a high level overview of a handful of papers each day and will link all papers in the description for further reading. This podcast is created entirely with AI by PocketPod. Head over to https://pocketpod.app to learn more.

Last refreshed: December 3rd, 2025 at 6:03 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Real-Time Motion Control, Next-Gen Visual Captions, 3D Scene Reconstruction Innovations

MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting SAGS: Structure-Aware 3D Gaussian Splatting Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting...

May 09, 2024•12 min•Ep. 25

Kolmogorov-Arnold Networks, Iterative Reasoning Optimization, Extending Llama-3 Context Length

KAN: Kolmogorov-Arnold Networks InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation Better & Faster Large Language Models via Multi-token Prediction Iterative Reasoning Preference Optimization Extending Llama-3's Context Ten-Fold Overnight...

May 08, 2024•11 min•Ep. 24

Innovative Image Editing, Advanced Autonomous Tracking, and the Evolution of Open-Source AI

Paint by Inpaint: Learning to Add Image Objects by Removing Them First Self-Play Preference Optimization for Language Model Alignment Automatic Creative Selection with Cross-Modal Matching STT: Stateful Tracking with Transformers for Autonomous Driving Octopus v4: Graph of language models...

May 07, 2024•12 min•Ep. 23

GPT-4 Rival Models, Revolutionizing Open Source LM Evaluation, StoryDiffusion's Visual Narrative Breakthrough

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models WildChat: 1M ChatGPT Interaction Logs in the Wild StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report LLM-AD: Large Language Model based Audio Description System...

May 03, 2024•12 min•Ep. 22

Model Editing Insights with Llama-3, Rethinking Large Language Models in Math, 3D Rendering and Audio Compression

Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3 A Careful Examination of Large Language Model Performance on Grade School Arithmetic Spectrally Pruned Gaussian Fields with Neural Compensation SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge...

May 02, 2024•12 min•Ep. 21

Advancing LLMs with Multi-Token Prediction, Octopus v4 Revolution in Open-Source Language Models, Enhancing Reasoning with Iterative Preference Optimization

Octopus v4: Graph of language models InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation Better & Faster Large Language Models via Multi-token Prediction GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting Iterative Reasoning Preference Optimization...

May 01, 2024•12 min•Ep. 20

Evaluating LLMs with Diverse Models, Novel Robotic Skills Framework, Editing 3D Graphics with VLMs

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models LEGENT: Open Platform for Embodied Agents Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting BlenderAlchemy: Editing 3D Graphics with Vision-Language Models...

Apr 30, 2024•11 min•Ep. 19

PLLaVA Breakthrough in Video-Language Modeling, Exploring Landmarks with HaLo-NeRF, and MaPa's Text-driven 3D Material Painting

AI Papers Podcast for 04/29/2024 PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections MaPa: Text-driven Photorealistic Material Painting for 3D Shapes...

Apr 29, 2024•9 min•Ep. 18

Bridging the Gap to GPT-4V, Interactive 3D Generation, Accelerating LLM Inference

AI Papers Podcast for 04/26/2024 How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Interactive3D: Create What You Want by Interactive 3D Generation Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding Tele-FLM Technical Report SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension...

Apr 26, 2024•12 min•Ep. 17

Hyper-SD Breakthrough, MAIA's Neural Understanding, SEED-X Multimodal Innovation

AI Papers Podcast for 04/25/2024 Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis A Multimodal Automated Interpretability Agent SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation MultiBooth: Towards Generating All Your Concepts in an Image from Text Learning H-Infinity Locomotion Control...

Apr 25, 2024•11 min•Ep. 16

Enhancing AI with Multi-Head MoEs, Pegasus-1's Video Mastery, Optimizing Diffusion Models,

AI Papers Podcast for 04/24/2024 OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework Multi-Head Mixture-of-Experts Pegasus-v1 Technical Report Align Your Steps: Optimizing Sampling Schedules in Diffusion Models SnapKV: LLM Knows What You are Looking for Before Generation...

Apr 24, 2024•11 min•Ep. 15

Model Efficiency, Instruction Prioritization, and Workflow Automation

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions FlowMind: Automatic Workflow Generation with LLMs Music Consistency Models How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study...

Apr 23, 2024•12 min•Ep. 14

Physics-Based Video, Text-Centric Visuals, Gaussian Splatting, Program Repair, Progressive Web Crawling

AI Papers Podcast for 04/23/2024 PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation TextSquare: Scaling up Text-Centric Visual Instruction Tuning Does Gaussian Splatting need SFM Initialization? How Far Can We Go with Practical Function-Level Program Repair? AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation...

Apr 22, 2024•12 min•Ep. 13

Adapting Diverse Controls: Ctrl-Adapter, HQ-Edit, Tango 2

AI Papers Podcast for 04/21/2024 Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models On Speculative Decoding for Multimodal Large Language Models...

Apr 21, 2024•12 min•Ep. 12

Dynamic Typography, Mesh Reconstruction, and Personalized Image Generation

AI Papers Podcast for 04/20/2024 Dynamic Typography: Bringing Words to Life Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing MeshLRM: Large Reconstruction Model for High-Quality Mesh MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation EdgeFusion: On-Device Text-to-Image Generation...

Apr 20, 2024•11 min•Ep. 11

AI Papers for 04/19/2024: Multimodal Advancements, AI Animation, Speculative Decoding

AI Papers Podcast for 04/19/2024 BLINK: Multimodal Large Language Models Can See but Not Perceive Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models AniClipart: Clipart Animation with Text-to-Video Priors TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data...

Apr 19, 2024•12 min•Ep. 10

AI Papers Special Release: Llama 3

AI Papers Podcast for 04/19/2024 Meta releases Llama 3, claims it's among the best open models available

Apr 18, 2024•2 min•Ep. 9

AI Papers for 04/18/2024: "Generating Full-Length Music with Latent Diffusion"

AI Papers Podcast for 04/18/2024 Long-form music generation with latent diffusion Scaling Instructable Agents Across Many Simulated Worlds

Apr 18, 2024•6 min•Ep. 8

AI Papers for 04/17/2024: Efficient Methods for Model Alignment and Compression

AI Papers Podcast for 04/17/2024 Learn Your Reference Model for Real Good Alignment Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length TransformerFAM: Feedback attention is working memory Compression Represents Intelligence Linearly Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video...

Apr 17, 2024•12 min•Ep. 7

AI Papers for 04/16/2024: Advancing Language Models for Multimodal and Long-context Learning

AI Papers Podcast for 04/16/2024 Octopus v2: On-device language model for super agent Advancing LLM Reasoning Generalists with Preference Trees Long-context LLMs Struggle with Long In-context Learning LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model Bigger is not Always Better: Scaling Properties of Latent Diffusion Models...

Apr 16, 2024•12 min•Ep. 6

AI Papers for 04/15/2024: Modernizing Segmentation, Analyzing CLIP, and Probing 3D Awareness in Vision Models

AI Papers Podcast for 04/15/2024 COCONut: Modernizing COCO Segmentation Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation Pre-training Small Base LMs with Fewer Tokens Probing the 3D Awareness of Visual Foundation Models...

Apr 15, 2024•11 min•Ep. 5

AI Papers Podcast for 04/14/2024

AI Papers Podcast for 04/14/2024 Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences: https://arxiv.org/abs/2404.03715 No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance: https://arxiv.org/abs/2404.04125 AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent: https://arxiv.org/abs/2404.03648 Stream of Search (SoS): Learning to Search in Language: https://arxiv.org/ab...

Apr 14, 2024•11 min•Ep. 4

AI Papers Podcast for 04/13/2024

AI Papers Podcast for 04/13/2024 OmniFusion Technical Report: https://arxiv.org/abs/2404.06212 LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders: https://arxiv.org/abs/2404.05961 InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD: https://arxiv.org/abs/2404.06512 Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence: https://arxiv.org/abs/2404.05892 MiniCPM: Unveiling the Potential of Small Language M...

Apr 13, 2024•13 min•Ep. 3

AI Papers Podcast for 04/12/2024

AI Papers Podcast 04/12/2024 RecurrentGemma: Moving Past Transformers for Efficient Open Language Models: https://arxiv.org/abs/2404.07839 WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents: https://arxiv.org/abs/2404.05902 Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models: https://arxiv.org/abs/2404.07973 Best Practices and Lessons Learned on Synthetic Data for Language Models: https://arxiv.org/abs/2404.07503 HGRN2: Gated Linear RNNs wi...

Apr 12, 2024•11 min•Ep. 2

AI Papers Podcast for 04/11/2024

AI Papers Podcast for 04/12/2024 ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback: https://arxiv.org/abs/2404.07987 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments: https://arxiv.org/abs/2404.07972 Rho-1: Not All Tokens Are What You Need: https://arxiv.org/abs/2404.07965 JetMoE: Reaching Llama2 Performance with 0.1M Dollars: https://arxiv.org/abs/2404.07413 Transferable and Principled Efficiency for Open-Vocabulary Segmen...

Apr 12, 2024•11 min•Ep. 1

← Prev

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android