Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

🤗 Upvotes: 29 | cs.CL, cs.AI Authors: Minki Kang, Jongwon Jeong, Jaewoong Cho Title: T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models Arxiv: http://arxiv.org/abs/2504.04718v1 Abstract: Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verifi...

Apr 09, 2025•21 min•Ep. 651

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

🤗 Upvotes: 30 | cs.SE, cs.AI, cs.CL Authors: Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, Liang Xiang Title: Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving Arxiv: http://arxiv.org/abs/2504.02605v1 Abstract: The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. Howe...

Apr 08, 2025•26 min•Ep. 650

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

🤗 Upvotes: 98 | cs.AI Authors: Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianming Liu, Tongliang Liu, Yu Su, Huan Sun, Glen Berseth, Jianyun Nie, Ian Foster, Logan Ward, Qingyun Wu, Yu Gu, M...

Apr 05, 2025•21 min•Ep. 649

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

🤗 Upvotes: 55 | cs.CV Authors: Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Hao Li, Zicheng Zhang, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan Title: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing Arxiv: http://arxiv.org/abs/2504.02826v1 Abstract: Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructi...

Apr 05, 2025•23 min•Ep. 648

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

🤗 Upvotes: 47 | cs.LG, cs.CL Authors: Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra Title: ZClip: Adaptive Spike Mitigation for LLM Pre-Training Arxiv: http://arxiv.org/abs/2504.02507v1 Abstract: Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as cons...

Apr 05, 2025•20 min•Ep. 647

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

🤗 Upvotes: 34 | cs.CV Authors: Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan Title: GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation Arxiv: http://arxiv.org/abs/2504.02782v1 Abstract: The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report pres...

Apr 05, 2025•22 min•Ep. 646

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

🤗 Upvotes: 24 | cs.LG, cs.CL, cs.CV Authors: Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, Pengfei Liu Title: Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme Arxiv: http://arxiv.org/abs/2504.02587v1 Abstract: Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, exist...

Apr 05, 2025•23 min•Ep. 645

WikiVideo: Article Generation from Multiple Videos

🤗 Upvotes: 24 | cs.CV, cs.CL Authors: Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme Title: WikiVideo: Article Generation from Multiple Videos Arxiv: http://arxiv.org/abs/2504.00939v1 Abstract: We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political electi...

Apr 05, 2025•22 min•Ep. 644

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

🤗 Upvotes: 57 | cs.CV, cs.AI Authors: Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, Zhen Lei Title: MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization Arxiv: http://arxiv.org/abs/2504.00999v1 Abstract: Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However...

Apr 04, 2025•20 min•Ep. 643

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

🤗 Upvotes: 30 | cs.CV Authors: Junhao Cheng, Yuying Ge, Yixiao Ge, Jing Liao, Ying Shan Title: AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction Arxiv: http://arxiv.org/abs/2504.01014v1 Abstract: Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic ani...

Apr 04, 2025•23 min•Ep. 642

Understanding R1-Zero-Like Training: A Critical Perspective

🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL Authors: Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin Title: Understanding R1-Zero-Like Training: A Critical Perspective Arxiv: http://arxiv.org/abs/2503.20783v1 Abstract: DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core com...

Apr 04, 2025•20 min•Ep. 641

Towards Physically Plausible Video Generation via VLM Planning

🤗 Upvotes: 25 | cs.CV, cs.AI Authors: Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, Xu Jia Title: Towards Physically Plausible Video Generation via VLM Planning Arxiv: http://arxiv.org/abs/2503.23368v2 Abstract: Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. ...

Apr 04, 2025•22 min•Ep. 640

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

🤗 Upvotes: 24 | cs.CV, cs.AI Authors: Yuxuan Luo, Zhengkun Rong, Lizhen Wang, Longhao Zhang, Tianshu Hu, Yongming Zhu Title: DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance Arxiv: http://arxiv.org/abs/2504.01724v2 Abstract: While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which lea...

Apr 04, 2025•21 min•Ep. 639

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

🤗 Upvotes: 22 | cs.CV Authors: Hanyang Wang, Fangfu Liu, Jiawei Chi, Yueqi Duan Title: VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step Arxiv: http://arxiv.org/abs/2504.01956v2 Abstract: Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from p...

Apr 04, 2025•22 min•Ep. 638

START: Self-taught Reasoner with Tools

🤗 Upvotes: 49 | cs.CL Authors: Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, Dayiheng Liu Title: START: Self-taught Reasoner with Tools Arxiv: http://arxiv.org/abs/2503.04625v1 Abstract: Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations ...

Mar 08, 2025•25 min•Ep. 637

Token-Efficient Long Video Understanding for Multimodal LLMs

🤗 Upvotes: 41 | cs.CV Authors: Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon Title: Token-Efficient Long Video Understanding for Multimodal LLMs Arxiv: http://arxiv.org/abs/2503.04130v1 Abstract: Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of ...

Mar 08, 2025•21 min•Ep. 636

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

🤗 Upvotes: 33 | cs.CL Authors: Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal Title: LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Arxiv: http://arxiv.org/abs/2503.04724v1 Abstract: Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misal...

Mar 08, 2025•26 min•Ep. 635

EgoLife: Towards Egocentric Life Assistant

🤗 Upvotes: 21 | cs.CV Authors: Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu Title: EgoLife: Towards Egocentric Life Assistant Arxiv: http://arxiv.org/abs/2503.03803v1 Abstract: We introduce EgoLife, a project to develop an egocentric life assista...

Mar 08, 2025•22 min•Ep. 634

Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers

🤗 Upvotes: 42 | cs.CL, cs.AI Authors: Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying, Mahani Aljunied, Zhaodonghui Li, Lidong Bing, Hou Pong Chan, Yu Rong, Deli Zhao, Wenxuan Zhang Title: Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers Arxiv: http://arxiv.org/abs/2503.00865v1 Abstract: Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in langua...

Mar 07, 2025•18 min•Ep. 633

HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

🤗 Upvotes: 27 | cs.CL, cs.HC Authors: Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen Title: HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs Arxiv: http://arxiv.org/abs/2503.02003v2 Abstract: An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this pro...

Mar 07, 2025•24 min•Ep. 632

Process-based Self-Rewarding Language Models

🤗 Upvotes: 27 | cs.CL, cs.AI Authors: Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, Yeyun Gong Title: Process-based Self-Rewarding Language Models Arxiv: http://arxiv.org/abs/2503.03746v1 Abstract: Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs' performance, which is constrained by the uppe...

Mar 07, 2025•24 min•Ep. 631

Visual-RFT: Visual Reinforcement Fine-Tuning

🤗 Upvotes: 44 | cs.CV Authors: Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang Title: Visual-RFT: Visual Reinforcement Fine-Tuning Arxiv: http://arxiv.org/abs/2503.01785v1 Abstract: Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning wit...

Mar 05, 2025•23 min•Ep. 630

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

🤗 Upvotes: 42 | cs.CL, cs.AI, cs.LG Authors: Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui...

Mar 05, 2025•26 min•Ep. 629

Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

🤗 Upvotes: 30 | cs.CV Authors: Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, Huan Ling Title: Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models Arxiv: http://arxiv.org/abs/2503.01774v1 Abstract: Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as ...

Mar 05, 2025•19 min•Ep. 628

DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking

🤗 Upvotes: 27 | cs.AI Authors: Zhuoqun Li, Haiyang Yu, Xuanang Chen, Hongyu Lin, Yaojie Lu, Fei Huang, Xianpei Han, Yongbin Li, Le Sun Title: DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking Arxiv: http://arxiv.org/abs/2502.20730v1 Abstract: Designing solutions for complex engineering challenges is crucial in human production activities. However, previous research in the retrieval-augmented generation (RAG) field has not sufficiently ad...

Mar 04, 2025•23 min•Ep. 627

Chain of Draft: Thinking Faster by Writing Less

🤗 Upvotes: 27 | cs.CL, I.2.7 Authors: Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He Title: Chain of Draft: Thinking Faster by Writing Less Arxiv: http://arxiv.org/abs/2502.18600v1 Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate th...

Mar 04, 2025•23 min•Ep. 626

Multi-Turn Code Generation Through Single-Step Rewards

🤗 Upvotes: 21 | cs.LG, cs.AI, cs.CL Authors: Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury Title: Multi-Turn Code Generation Through Single-Step Rewards Arxiv: http://arxiv.org/abs/2502.20380v1 Abstract: We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a si...

Mar 04, 2025•26 min•Ep. 625

Self-rewarding correction for mathematical reasoning

🤗 Upvotes: 51 | cs.AI, cs.LG Authors: Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang Title: Self-rewarding correction for mathematical reasoning Arxiv: http://arxiv.org/abs/2502.19613v1 Abstract: We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to indepen...

Mar 01, 2025•25 min•Ep. 624

MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

🤗 Upvotes: 44 | cs.CV, cs.AI Authors: Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, Daniel Rueckert Title: MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning Arxiv: http://arxiv.org/abs/2502.19634v1 Abstract: Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory appr...

Mar 01, 2025•23 min•Ep. 623

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

🤗 Upvotes: 33 | cs.LG Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou Title: R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts Arxiv: http://arxiv.org/abs/2502.20395v1 Abstract: In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by r...

Mar 01, 2025•22 min•Ep. 622

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android