Daily Paper Cast - podcast cover

Daily Paper Cast

Jingwen Liang, Gengyu Wangβ€’dailypapercast.transistor.fm
We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art
Last refreshed: β“˜
Follow this podcast in the Metacast mobile app to refresh it and see new episodes.
Download Metacast podcast app
Podcasts are better in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization

πŸ€— Upvotes: 23 | cs.LG, cs.CL, cs.NA, math.DG, math.NA, 68T07, 65F55, 53Z50 Authors: Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Soboleva, Aibek Alanov, Maxim Rakhuba Title: RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization Arxiv: http://arxiv.org/abs/2507.12142v1 Abstract: Low-Rank Adaptation (LoRA) has become a widely adopted standard for parameter-efficient fine-tuning of large language models (LLMs), significantly reducing mem...

Jul 19, 2025β€’23 minβ€’Ep. 981

Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

πŸ€— Upvotes: 50 | cs.CL, cs.AI Authors: Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu Title: Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs Arxiv: http://arxiv.org/abs/2507.09477v2 Abstract: Retrieval-Augmented Generation (RAG) lifts the factuali...

Jul 18, 2025β€’21 minβ€’Ep. 980

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

πŸ€— Upvotes: 32 | cs.CV Authors: Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao Title: Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models Arxiv: http://arxiv.org/abs/2507.07104v2 Abstract: Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the...

Jul 17, 2025β€’20 minβ€’Ep. 979

EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

πŸ€— Upvotes: 24 | cs.CL, cs.AI Authors: LG AI Research, :, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Kyubeen Han, Seokhee Hong, Junwon Hwang, Taewan Hwang, Joonwon Jang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Euisoon Kim, Hyosang Kim, Jihoon Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Gwangho Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park,...

Jul 17, 2025β€’19 minβ€’Ep. 978

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

πŸ€— Upvotes: 44 | cs.LG, cs.AI, cs.CL Authors: Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang Title: Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination Arxiv: http://arxiv.org/abs/2507.10532v1 Abstract: The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities...

Jul 16, 2025β€’21 minβ€’Ep. 977

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

πŸ€— Upvotes: 43 | cs.CV, eess.AS Authors: Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li Title: SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation Arxiv: http://arxiv.org/abs/2507.09862v1 Abstract: The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and ...

Jul 16, 2025β€’20 minβ€’Ep. 976

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

πŸ€— Upvotes: 31 | cs.CL, cs.LG Authors: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun Title: Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation Arxiv: http://arxiv.org/abs/2507.10524v1 Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existin...

Jul 16, 2025β€’22 minβ€’Ep. 975

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

πŸ€— Upvotes: 25 | cs.CV, cs.AI, cs.CL Authors: Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi Title: EmbRACE-3K: Embodied Reasoning and Action in Complex Environments Arxiv: http://arxiv.org/abs/2507.10548v1 Abstract: Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interact...

Jul 16, 2025β€’22 minβ€’Ep. 974

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

πŸ€— Upvotes: 22 | cs.CL Authors: Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu Title: REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once Arxiv: http://arxiv.org/abs/2507.10541v2 Abstract: Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-quest...

Jul 16, 2025β€’25 minβ€’Ep. 973

Test-Time Scaling with Reflective Generative Model

πŸ€— Upvotes: 68 | cs.LG, cs.CL Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie Title: Test-Time Scaling with Reflective Generative Model Arxiv: http://arxiv.org/abs/2507.01951v2 Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory sele...

Jul 15, 2025β€’22 minβ€’Ep. 972

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

πŸ€— Upvotes: 47 | cs.CV, cs.CL Authors: Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel Title: Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning Arxiv: http://arxiv.org/abs/2507.05255v1 Abstract: The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors ...

Jul 15, 2025β€’21 minβ€’Ep. 971

NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

πŸ€— Upvotes: 45 | cs.CV, cs.AI, cs.CL, cs.HC, cs.LG Authors: Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng Title: NeuralOS: Towards Simulating Operating Systems via Neural Generative Models Arxiv: http://arxiv.org/abs/2507.08800v1 Abstract: We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines ...

Jul 15, 2025β€’21 minβ€’Ep. 970

CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

πŸ€— Upvotes: 43 | cs.CV Authors: Zhengqing Wang, Yuefan Wu, Jiacheng Chen, Fuyang Zhang, Yasutaka Furukawa Title: CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering Arxiv: http://arxiv.org/abs/2507.08776v2 Abstract: This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, w...

Jul 15, 2025β€’21 minβ€’Ep. 969

KV Cache Steering for Inducing Reasoning in Small Language Models

πŸ€— Upvotes: 26 | cs.CL, cs.AI Authors: Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano Title: KV Cache Steering for Inducing Reasoning in Small Language Models Arxiv: http://arxiv.org/abs/2507.08799v1 Abstract: We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-t...

Jul 15, 2025β€’23 minβ€’Ep. 968

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

πŸ€— Upvotes: 24 | cs.CL, cs.AI Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-...

Jul 15, 2025β€’21 minβ€’Ep. 967

Neural-Driven Image Editing

πŸ€— Upvotes: 22 | cs.CV Authors: Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You Title: Neural-Driven Image Editing Arxiv: http://arxiv.org/abs/2507.05397v1 Abstract: Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilitie...

Jul 15, 2025β€’21 minβ€’Ep. 966

Scaling RL to Long Videos

πŸ€— Upvotes: 95 | cs.CV, cs.AI, cs.CL Authors: Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han Title: Scaling RL to Long Videos Arxiv: http://arxiv.org/abs/2507.07966v1 Abstract: We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning b...

Jul 12, 2025β€’23 minβ€’Ep. 965

T-LoRA: Single Image Diffusion Model Customization Without Overfitting

πŸ€— Upvotes: 83 | cs.CV Authors: Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev Title: T-LoRA: Single Image Diffusion Model Customization Without Overfitting Arxiv: http://arxiv.org/abs/2507.05964v1 Abstract: While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This pa...

Jul 12, 2025β€’23 minβ€’Ep. 964

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

πŸ€— Upvotes: 37 | cs.CV, cs.AI, cs.CL Authors: Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang Title: Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Arxiv: http://arxiv.org/abs/2507.07999v1 Abstract: Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no bench...

Jul 12, 2025β€’20 minβ€’Ep. 963

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

πŸ€— Upvotes: 29 | cs.CV Authors: JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang Title: OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding Arxiv: http://arxiv.org/abs/2507.07984v1 Abstract: Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a ...

Jul 12, 2025β€’23 minβ€’Ep. 962

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

πŸ€— Upvotes: 24 | cs.CV, cs.AI Authors: Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim Title: Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs Arxiv: http://arxiv.org/abs/2507.07990v1 Abstract: Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. T...

Jul 12, 2025β€’23 minβ€’Ep. 961

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

πŸ€— Upvotes: 23 | cs.CV, cs.AI Authors: Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian Title: Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling Arxiv: http://arxiv.org/abs/2507.07982v1 Abstract: Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their lea...

Jul 12, 2025β€’21 minβ€’Ep. 960

PyVision: Agentic Vision with Dynamic Tooling

πŸ€— Upvotes: 22 | cs.CL, cs.AI, cs.CV Authors: Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei Title: PyVision: Agentic Vision with Dynamic Tooling Arxiv: http://arxiv.org/abs/2507.07998v1 Abstract: LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present...

Jul 12, 2025β€’19 minβ€’Ep. 959

4KAgent: Agentic Any Image to 4K Super-Resolution

πŸ€— Upvotes: 56 | cs.CV, eess.IV Authors: Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu Title: 4KAgent: Agentic Any Image to 4K Super-Resolution Arxiv: http://arxiv.org/abs/2507.07105v1 Abstract: We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our syst...

Jul 11, 2025β€’27 minβ€’Ep. 958

Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

πŸ€— Upvotes: 41 | cs.CV Authors: Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, Jingbo Wang Title: Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data Arxiv: http://arxiv.org/abs/2507.07095v1 Abstract: Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancem...

Jul 11, 2025β€’17 minβ€’Ep. 957

Perception-Aware Policy Optimization for Multimodal Reasoning

πŸ€— Upvotes: 34 | cs.CL Authors: Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji Title: Perception-Aware Policy Optimization for Multimodal Reasoning Arxiv: http://arxiv.org/abs/2507.06448v1 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and ...

Jul 11, 2025β€’23 minβ€’Ep. 956

MIRIX: Multi-Agent Memory System for LLM-Based Agents

πŸ€— Upvotes: 33 | cs.CL, cs.AI Authors: Yu Wang, Xi Chen Title: MIRIX: Multi-Agent Memory System for LLM-Based Agents Arxiv: http://arxiv.org/abs/2507.07957v1 Abstract: Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular,...

Jul 11, 2025β€’22 minβ€’Ep. 955

Rethinking Verification for LLM Code Generation: From Generation to Testing

πŸ€— Upvotes: 23 | cs.CL Authors: Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen Title: Rethinking Verification for LLM Code Generation: From Generation to Testing Arxiv: http://arxiv.org/abs/2507.06920v2 Abstract: Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited ...

Jul 11, 2025β€’22 minβ€’Ep. 954

SingLoRA: Low Rank Adaptation Using a Single Matrix

πŸ€— Upvotes: 68 | cs.AI Authors: David BensaΓ―d, Noam Rotstein, Roy Velich, Daniel BensaΓ―d, Ron Kimmel Title: SingLoRA: Low Rank Adaptation Using a Single Matrix Arxiv: http://arxiv.org/abs/2507.05566v1 Abstract: Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two smaller matrices that together form a low-rank matrix update. Recent research has shown that scal...

Jul 10, 2025β€’21 minβ€’Ep. 953

A Survey on Latent Reasoning

πŸ€— Upvotes: 60 | cs.CL Authors: Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian Title: A Survey on Latent Reasoning...

Jul 10, 2025β€’20 minβ€’Ep. 952
Hosted on Transistor
For the best experience, listen in Metacast app for iOS or Android