Daily Paper Cast - podcast cover

Daily Paper Cast

Jingwen Liang, Gengyu Wangβ€’dailypapercast.transistor.fm
We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art
Last refreshed: β“˜
Follow this podcast in the Metacast mobile app to refresh it and see new episodes.
Download Metacast podcast app
Podcasts are better in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

πŸ€— Upvotes: 99 | cs.CL, cs.AI, cs.LG Authors: Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin Title: Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning Arxiv: http://arxiv.org/abs/2506.01939v1 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerf...

Jun 04, 2025β€’22 minβ€’Ep. 861

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

πŸ€— Upvotes: 52 | cs.LG, cs.AI, cs.CL Authors: Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas KΓΆpf Title: REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards Arxiv: http://arxiv.org/abs/2505.24760v1 Abstract: We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple doma...

Jun 04, 2025β€’22 minβ€’Ep. 860

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

πŸ€— Upvotes: 48 | cs.LG, cs.RO Authors: Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, Remi Cadene Title: SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Arxiv: http://arxiv.org/abs/2506.01844v1 Abstract: Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and ling...

Jun 04, 2025β€’21 minβ€’Ep. 859

Taming LLMs by Scaling Learning Rates with Gradient Grouping

πŸ€— Upvotes: 33 | cs.LG, cs.AI Authors: Siyuan Li, Juanxi Tian, Zedong Wang, Xin Jin, Zicheng Liu, Wentao Zhang, Dan Xu Title: Taming LLMs by Scaling Learning Rates with Gradient Grouping Arxiv: http://arxiv.org/abs/2506.01049v1 Abstract: Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rat...

Jun 04, 2025β€’20 minβ€’Ep. 858

ARIA: Training Language Agents with Intention-Driven Reward Aggregation

πŸ€— Upvotes: 26 | cs.CL Authors: Ruihan Yang, Yikai Zhang, Aili Chen, Xintao Wang, Siyu Yuan, Jiangjie Chen, Deqing Yang, Yanghua Xiao Title: ARIA: Training Language Agents with Intention-Driven Reward Aggregation Arxiv: http://arxiv.org/abs/2506.00539v1 Abstract: Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games),...

Jun 04, 2025β€’24 minβ€’Ep. 857

Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

πŸ€— Upvotes: 24 | cs.CV Authors: Kinam Kim, Junha Hyung, Jaegul Choo Title: Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models Arxiv: http://arxiv.org/abs/2506.00996v1 Abstract: Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural mo...

Jun 04, 2025β€’20 minβ€’Ep. 856

LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

πŸ€— Upvotes: 24 | cs.RO, cs.AI Authors: Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, Zhijie Deng Title: LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks Arxiv: http://arxiv.org/abs/2506.00411v1 Abstract: Real-world embodied agents face long-horizon tasks, characterized by high-level goals demanding multi-step solutions beyond single actions. Successfully navigating these requires both high-level task planning (i.e., decomposing goals into sub-tasks) and low-level mot...

Jun 04, 2025β€’19 minβ€’Ep. 855

Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

πŸ€— Upvotes: 24 | cs.CV, cs.AI, cs.CL Authors: Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko Title: Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles Arxiv: http://arxiv.org/abs/2505.23590v2 Abstract: The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy t...

Jun 04, 2025β€’25 minβ€’Ep. 854

ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

πŸ€— Upvotes: 23 | cs.CV Authors: Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, Jun Zhu Title: ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding Arxiv: http://arxiv.org/abs/2506.01853v1 Abstract: Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate ...

Jun 04, 2025β€’22 minβ€’Ep. 853

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

πŸ€— Upvotes: 21 | cs.CL Authors: Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Yangfan He, Mi Zhang, Shen Yan Title: SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning Arxiv: http://arxiv.org/abs/2506.01713v1 Abstract: Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self...

Jun 04, 2025β€’21 minβ€’Ep. 852

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

πŸ€— Upvotes: 83 | cs.CL, cs.AI Authors: Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong Title: ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models Arxiv: http://arxiv.org/abs/2505.24864v1 Abstract: Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands ...

Jun 03, 2025β€’21 minβ€’Ep. 851

AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

πŸ€— Upvotes: 63 | cs.CL Authors: Junyu Zhang, Runpei Dong, Han Wang, Xuying Ning, Haoran Geng, Peihao Li, Xialin He, Yutong Bai, Jitendra Malik, Saurabh Gupta, Huan Zhang Title: AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time Arxiv: http://arxiv.org/abs/2505.24863v1 Abstract: This paper presents AlphaOne ($\alpha$1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. $\alpha$1 first introduces $\alpha$ moment, which represents the...

Jun 03, 2025β€’21 minβ€’Ep. 850

Time Blindness: Why Video-Language Models Can't See What Humans Can?

πŸ€— Upvotes: 59 | cs.CV, cs.AI Authors: Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny Title: Time Blindness: Why Video-Language Models Can't See What Humans Can? Arxiv: http://arxiv.org/abs/2505.24867v1 Abstract: Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{Spooky...

Jun 03, 2025β€’22 minβ€’Ep. 849

HardTests: Synthesizing High-Quality Test Cases for LLM Coding

πŸ€— Upvotes: 37 | cs.CL Authors: Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, Lei Li Title: HardTests: Synthesizing High-Quality Test Cases for LLM Coding Arxiv: http://arxiv.org/abs/2505.24098v1 Abstract: Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguise...

Jun 03, 2025β€’22 minβ€’Ep. 848

Large Language Models for Data Synthesis

πŸ€— Upvotes: 36 | cs.LG Authors: Yihong Tang, Menglin Kong, Lijun Sun Title: Large Language Models for Data Synthesis Arxiv: http://arxiv.org/abs/2505.14752v1 Abstract: Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Lan...

Jun 03, 2025β€’23 minβ€’Ep. 847

Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

πŸ€— Upvotes: 29 | cs.CL, cs.CV Authors: Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu Title: Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation Arxiv: http://arxiv.org/abs/2505.18842v1 Abstract: We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over...

Jun 03, 2025β€’22 minβ€’Ep. 846

ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

πŸ€— Upvotes: 27 | cs.CV Authors: Cailin Zhuang, Ailin Huang, Wei Cheng, Jingwei Wu, Yaoqi Hu, Jiaqi Liao, Zhewei Huang, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Gang Yu, Chi Zhang Title: ViStoryBench: Comprehensive Benchmark Suite for Story Visualization Arxiv: http://arxiv.org/abs/2505.24862v1 Abstract: Story visualization, which aims to generate a sequence of visually coherent images aligning with a given narrative and reference images, has seen signif...

Jun 03, 2025β€’21 minβ€’Ep. 845

DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

πŸ€— Upvotes: 21 | cs.CV, cs.AI Authors: Chenbin Pan, Wenbin He, Zhengzhong Tu, Liu Ren Title: DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models Arxiv: http://arxiv.org/abs/2505.24025v1 Abstract: The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, ...

Jun 03, 2025β€’23 minβ€’Ep. 844

Table-R1: Inference-Time Scaling for Table Reasoning

πŸ€— Upvotes: 66 | cs.CL Authors: Zheyuan Yang, Lyuhao Chen, Arman Cohan, Yilun Zhao Title: Table-R1: Inference-Time Scaling for Table Reasoning Arxiv: http://arxiv.org/abs/2505.23621v1 Abstract: In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For disti...

May 31, 2025β€’21 minβ€’Ep. 843

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

πŸ€— Upvotes: 54 | cs.CV, cs.AI, cs.LG, I.2.6; I.2 Authors: Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan Title: Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence Arxiv: http://arxiv.org/abs/2505.23747v1 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to inc...

May 31, 2025β€’20 minβ€’Ep. 842

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

πŸ€— Upvotes: 51 | cs.CV, cs.AI, cs.CL Authors: Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao Title: VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos Arxiv: http://arxiv.org/abs/2505.23693v1 Abstract: MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality...

May 31, 2025β€’26 minβ€’Ep. 841

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

πŸ€— Upvotes: 45 | cs.CL Authors: Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan Title: The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason Arxiv: http://arxiv.org/abs/2505.22653v1 Abstract: Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact o...

May 31, 2025β€’22 minβ€’Ep. 840

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

πŸ€— Upvotes: 39 | cs.AI, cs.CL, cs.CV Authors: Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai Title: ZeroGUI: Automating Online GUI Learning at Zero Human Cost Arxiv: http://arxiv.org/abs/2505.23762v1 Abstract: The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interf...

May 31, 2025β€’19 minβ€’Ep. 839

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

πŸ€— Upvotes: 28 | cs.CV Authors: Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, Xu Sun Title: VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? Arxiv: http://arxiv.org/abs/2505.23359v1 Abstract: Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of v...

May 31, 2025β€’22 minβ€’Ep. 838

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

πŸ€— Upvotes: 21 | cs.CL, cs.AI, cs.SE Authors: Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong, Chuang Gan Title: Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering Arxiv: http://arxiv.org/abs/2505.23604v1 Abstract: Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Benc...

May 31, 2025β€’22 minβ€’Ep. 837

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

πŸ€— Upvotes: 84 | cs.LG, cs.AI, cs.CL Authors: Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding Title: The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models Arxiv: http://arxiv.org/abs/2505.22617v1 Abstract: This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entro...

May 30, 2025β€’22 minβ€’Ep. 836

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

πŸ€— Upvotes: 63 | cs.SE, cs.CL Authors: Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel Title: SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents Arxiv: http://arxiv.org/abs/2505.20411v1 Abstract: LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advanci...

May 30, 2025β€’21 minβ€’Ep. 835

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

πŸ€— Upvotes: 59 | cs.CL, cs.AI, cs.LG, cs.PF, I.2.7 Authors: Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang Title: R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing Arxiv: http://arxiv.org/abs/2505.21600v1 Abstract: Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Lan...

May 30, 2025β€’23 minβ€’Ep. 834

Skywork Open Reasoner 1 Technical Report

πŸ€— Upvotes: 45 | cs.LG, cs.AI, cs.CL Authors: Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, Yahui Zhou Title: Skywork Open Reasoner 1 Technical Report Arxiv: http://arxiv.org/abs/2505.22312v2 Abstract: The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language mod...

May 30, 2025β€’22 minβ€’Ep. 833

Sherlock: Self-Correcting Reasoning in Vision-Language Models

πŸ€— Upvotes: 44 | cs.CV, cs.CL, cs.LG Authors: Yi Ding, Ruqi Zhang Title: Sherlock: Self-Correcting Reasoning in Vision-Language Models Arxiv: http://arxiv.org/abs/2505.22651v1 Abstract: Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. T...

May 30, 2025β€’21 minβ€’Ep. 832
Hosted on Transistor
For the best experience, listen in Metacast app for iOS or Android