Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

VideoRAG: Retrieval-Augmented Generation over Video Corpus

🤗 Upvotes: 43 | cs.CV, cs.AI, cs.CL, cs.IR, cs.LG Authors: Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang Title: VideoRAG: Retrieval-Augmented Generation over Video Corpus Arxiv: http://arxiv.org/abs/2501.05874v1 Abstract: Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing ...

Jan 14, 2025•22 min•Ep. 381

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

🤗 Upvotes: 29 | cs.CV, cs.AI Authors: Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang Title: OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? Arxiv: http://arxiv.org/abs/2501.05510v1 Abstract: Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between o...

Jan 14, 2025•22 min•Ep. 380

Enabling Scalable Oversight via Self-Evolving Critic

🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG Authors: Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin Title: Enabling Scalable Oversight via Self-Evolving Critic Arxiv: http://arxiv.org/abs/2501.05727v1 Abstract: Despite their remarkable performance, the development of Large Language Models (LLMs) faces a critical challenge in scalable oversight: providing effective feedback for tasks where human evaluation is difficu...

Jan 14, 2025•28 min•Ep. 379

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

🤗 Upvotes: 14 | cs.CL, cs.AI, cs.CV Authors: You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun Title: Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models Arxiv: http://arxiv.org/abs/2501.05767v2 Abstract: The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension acr...

Jan 14, 2025•22 min•Ep. 378

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

🤗 Upvotes: 10 | cs.CV, cs.CL Authors: Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang Title: ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding Arxiv: http://arxiv.org/abs/2501.05452v1 Abstract: Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the f...

Jan 14, 2025•23 min•Ep. 377

ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning

🤗 Upvotes: 10 | cs.CV Authors: Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai Title: ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning Arxiv: http://arxiv.org/abs/2501.04698v1 Abstract: Text-to-video generation has made remarkable advancements through diffusion models. However, Multi-Concept Video Customization (MCVC) remains a significant challenge. We identify two key challen...

Jan 14, 2025•24 min•Ep. 376

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

🤗 Upvotes: 8 | cs.CL, cs.AI, cs.LG Authors: Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, Igor Mordatch Title: Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains Arxiv: http://arxiv.org/abs/2501.05707v1 Abstract: Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be...

Jan 14, 2025•22 min•Ep. 375

The GAN is dead; long live the GAN! A Modern GAN Baseline

🤗 Upvotes: 27 | cs.LG, cs.CV Authors: Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, James Tompkin Title: The GAN is dead; long live the GAN! A Modern GAN Baseline Arxiv: http://arxiv.org/abs/2501.05441v1 Abstract: There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regulariz...

Jan 11, 2025•20 min•Ep. 374

An Empirical Study of Autoregressive Pre-training from Videos

🤗 Upvotes: 17 | cs.CV, cs.AI Authors: Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik Title: An Empirical Study of Autoregressive Pre-training from Videos Arxiv: http://arxiv.org/abs/2501.05453v1 Abstract: We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer mod...

Jan 11, 2025•22 min•Ep. 373

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

🤗 Upvotes: 10 | cs.CV, cs.RO Authors: Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan Title: Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives Arxiv: http://arxiv.org/abs/2501.04003v1 Abstract: Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural...

Jan 11, 2025•22 min•Ep. 372

Entropy-Guided Attention for Private LLMs

🤗 Upvotes: 6 | cs.LG, cs.CR Authors: Nandan Kumar Jha, Brandon Reagen Title: Entropy-Guided Attention for Private LLMs Arxiv: http://arxiv.org/abs/2501.03489v2 Abstract: The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by su...

Jan 11, 2025•24 min•Ep. 371

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CC, cs.CV Authors: Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song Title: On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis Arxiv: http://arxiv.org/abs/2501.04377v1 Abstract: Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine "next-scale...

Jan 11, 2025•19 min•Ep. 370

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

🤗 Upvotes: 5 | cs.CL, cs.CV Authors: Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, Goran Glavaš Title: Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model Arxiv: http://arxiv.org/abs/2501.05122v1 Abstract: Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existi...

Jan 11, 2025•22 min•Ep. 369

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution

🤗 Upvotes: 4 | cs.CL Authors: Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, Kai Chen Title: SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution Arxiv: http://arxiv.org/abs/2501.05040v1 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on GitHub by fixin...

Jan 11, 2025•22 min•Ep. 368

Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

🤗 Upvotes: 3 | cs.CL Authors: Şaziye Betül Özateş, Tarık Emre Tıraş, Ece Elif Adak, Berat Doğan, Fatih Burak Karagöz, Efe Eren Genç, Esma F. Bilgin Taşdemir Title: Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models Arxiv: http://arxiv.org/abs/2501.04828v1 Abstract: This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics....

Jan 11, 2025•27 min•Ep. 367

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

🤗 Upvotes: 116 | cs.CL Authors: Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang Title: rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking Arxiv: http://arxiv.org/abs/2501.04519v1 Abstract: We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" ...

Jan 10, 2025•27 min•Ep. 366

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

🤗 Upvotes: 47 | cs.AI, cs.CL Authors: Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn Title: Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought Arxiv: http://arxiv.org/abs/2501.04682v1 Abstract: We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (Co...

Jan 10, 2025•25 min•Ep. 365

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

🤗 Upvotes: 38 | cs.CL, cs.AI, cs.LG Authors: Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, Yujiu Yang Title: URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics Arxiv: http://arxiv.org/abs/2501.04686v1 Abstract: Chain-of-thought (CoT) reasoning has been widely applied in the mathematical reasoning of Large Language Models (LLMs). Recently, the introduction of derivative process supervision on CoT trajectories has sparked di...

Jan 10, 2025•24 min•Ep. 364

Agent Laboratory: Using LLM Agents as Research Assistants

🤗 Upvotes: 38 | cs.HC, cs.AI, cs.CL, cs.LG Authors: Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, Emad Barsoum Title: Agent Laboratory: Using LLM Agents as Research Assistants Arxiv: http://arxiv.org/abs/2501.04227v1 Abstract: Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, a...

Jan 10, 2025•24 min•Ep. 363

LLM4SR: A Survey on Large Language Models for Scientific Research

🤗 Upvotes: 21 | cs.CL, cs.DL Authors: Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, Xinya Du Title: LLM4SR: A Survey on Large Language Models for Scientific Research Arxiv: http://arxiv.org/abs/2501.04306v1 Abstract: In recent years, the rapid advancement of Large Language Models (LLMs) has transformed the landscape of scientific research, offering unprecedented support across various stages of the research cycle. This paper presents the first systematic survey dedicated to exploring how LLMs a...

Jan 10, 2025•25 min•Ep. 362

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

🤗 Upvotes: 16 | cs.AI, cs.CL, cs.HC Authors: Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu Title: InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection Arxiv: http://arxiv.org/abs/2501.04575v1 Abstract: Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile ...

Jan 10, 2025•21 min•Ep. 361

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

🤗 Upvotes: 12 | cs.CV, cs.GR Authors: Zixuan Huang, Mark Boss, Aaryaman Vasishta, James M. Rehg, Varun Jampani Title: SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images Arxiv: http://arxiv.org/abs/2501.04689v1 Abstract: We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. Regression methods efficiently infer visible surfaces, but struggle with occluded regions. Ge...

Jan 10, 2025•23 min•Ep. 360

GeAR: Generation Augmented Retrieval

🤗 Upvotes: 12 | cs.IR, cs.CL Authors: Haoyu Liu, Shaohan Huang, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Furu Wei, Qi Zhang Title: GeAR: Generation Augmented Retrieval Arxiv: http://arxiv.org/abs/2501.02772v1 Abstract: Document retrieval techniques form the foundation for the development of large-scale information systems. The prevailing methodology is to construct a bi-encoder and compute the semantic similarity. However, such scalar similarity is difficult to reflect enough...

Jan 10, 2025•22 min•Ep. 359

Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

🤗 Upvotes: 10 | cs.CV, cs.GR Authors: Kam Woh Ng, Jing Yang, Jia Wei Sii, Jiankang Deng, Chee Seng Chan, Yi-Zhe Song, Tao Xiang, Xiatian Zhu Title: Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation Arxiv: http://arxiv.org/abs/2501.04144v1 Abstract: In this paper, we push the boundaries of fine-grained 3D generation into truly creative territory. Current methods either lack intricate details or simply mimic existing objects -- we enable both. By lifting 2D fine-grained understand...

Jan 10, 2025•24 min•Ep. 358

DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization

🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CL, 68T45 Authors: Amitava Das, Suranjana Trivedy, Danush Khanna, Rajarshi Roy, Gurpreet Singh, Basab Ghosh, Yaswanth Narsupalli, Vinija Jain, Vasu Sharma, Aishwarya Naresh Reganti, Aman Chadha Title: DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization Arxiv: http://arxiv.org/abs/2501.03271v2 Abstract: The rapid rise of large language models (LLMs) has unlocked many applications but also underscores ...

Jan 10, 2025•23 min•Ep. 357

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

🤗 Upvotes: 51 | cs.CL, cs.LG Authors: Jian Hu Title: REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models Arxiv: http://arxiv.org/abs/2501.03262v1 Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical approach for aligning large language models with human preferences, witnessing rapid algorithmic evolution through methods such as Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), REINFORCE Leave One-Out (RLOO), Re...

Jan 09, 2025•22 min•Ep. 356

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

🤗 Upvotes: 32 | cs.CV Authors: Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang Title: MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Arxiv: http://arxiv.org/abs/2501.02955v1 Abstract: In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explor...

Jan 09, 2025•23 min•Ep. 355

Cosmos World Foundation Model Platform for Physical AI

🤗 Upvotes: 31 | cs.CV, cs.AI, cs.LG, cs.RO Authors: NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Z...

Jan 09, 2025•26 min•Ep. 354

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

🤗 Upvotes: 22 | cs.CV, cs.AI, cs.CL Authors: Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng Title: LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Arxiv: http://arxiv.org/abs/2501.03895v1 Abstract: The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and textual instructions into ...

Jan 09, 2025•22 min•Ep. 353

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

🤗 Upvotes: 18 | cs.CV Authors: Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang Title: Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos Arxiv: http://arxiv.org/abs/2501.04001v1 Abstract: This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific moda...

Jan 09, 2025•23 min•Ep. 352

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android