Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention

🤗 Upvotes: 81 | cs.CV Authors: Dmitrii Mikhailov, Aleksey Letunovskiy, Maria Kovaleva, Vladimir Arkhipkin, Vladimir Korviakov, Vladimir Polovnikov, Viacheslav Vasilev, Evelina Sidorova, Denis Dimitrov Title: $\nabla$NABLA: Neighborhood Adaptive Block-Level Attention Arxiv: http://arxiv.org/abs/2507.13546v1 Abstract: Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms rema...

Jul 26, 2025•21 min•Ep. 1011

Group Sequence Policy Optimization

🤗 Upvotes: 57 | cs.LG, cs.AI, cs.CL Authors: Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin Title: Group Sequence Policy Optimization Arxiv: http://arxiv.org/abs/2507.18071v1 Abstract: This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-lev...

Jul 26, 2025•23 min•Ep. 1010

MUR: Momentum Uncertainty guided Reasoning for Large Language Models

🤗 Upvotes: 31 | cs.CL Authors: Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xiaobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, Jun Liu Title: MUR: Momentum Uncertainty guided Reasoning for Large Language Models Arxiv: http://arxiv.org/abs/2507.14958v1 Abstract: Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning q...

Jul 26, 2025•22 min•Ep. 1009

LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

🤗 Upvotes: 25 | cs.AI, cs.CL Authors: Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang Title: LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization Arxiv: http://arxiv.org/abs/2507.15758v1 Abstract: Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. W...

Jul 26, 2025•20 min•Ep. 1008

Pixels, Patterns, but No Poetry: To See The World like Humans

🤗 Upvotes: 48 | cs.CV, cs.AI, cs.CL Authors: Hongcheng Gao, Zihao Huang, Lin Xu, Jingyi Tang, Xinhao Li, Yue Liu, Haoyang Li, Taihang Hu, Minhua Lin, Xinlong Yang, Ge Wu, Balong Bi, Hongyu Chen, Wentao Zhang Title: Pixels, Patterns, but No Poetry: To See The World like Humans Arxiv: http://arxiv.org/abs/2507.16863v1 Abstract: Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has...

Jul 25, 2025•17 min•Ep. 1007

Yume: An Interactive World Generation Model

🤗 Upvotes: 45 | cs.CV, cs.AI, cs.HC Authors: Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang Title: Yume: An Interactive World Generation Model Arxiv: http://arxiv.org/abs/2507.17744v1 Abstract: Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \...

Jul 25, 2025•26 min•Ep. 1006

DesignLab: Designing Slides Through Iterative Detection and Correction

🤗 Upvotes: 33 | cs.CV, cs.AI Authors: Jooyeol Yun, Heng Wang, Yotaro Shimose, Jaegul Choo, Shingo Takamatsu Title: DesignLab: Designing Slides Through Iterative Detection and Correction Arxiv: http://arxiv.org/abs/2507.17202v1 Abstract: Designing high-quality presentation slides can be challenging for non-experts due to the complexity involved in navigating various design choices. Numerous automated tools can suggest layouts and color schemes, yet often lack the ability to refine their own outp...

Jul 25, 2025•23 min•Ep. 1005

Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning

🤗 Upvotes: 26 | cs.AI, cs.LG Authors: Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, Lijun Wu Title: Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning Arxiv: http://arxiv.org/abs/2507.17512v1 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathema...

Jul 25, 2025•19 min•Ep. 1004

Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

🤗 Upvotes: 77 | cs.CL Authors: Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass Title: Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning Arxiv: http://arxiv.org/abs/2507.16784v1 Abstract: To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decom...

Jul 24, 2025•21 min•Ep. 1003

Step-Audio 2 Technical Report

🤗 Upvotes: 42 | cs.CL, cs.SD, eess.AS Authors: Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, K...

Jul 24, 2025•23 min•Ep. 1002

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

🤗 Upvotes: 37 | cs.CL, cs.AI, cs.LG Authors: Run-Ze Fan, Zengzhi Wang, Pengfei Liu Title: MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning Arxiv: http://arxiv.org/abs/2507.16812v1 Abstract: Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, ...

Jul 24, 2025•20 min•Ep. 1001

Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

🤗 Upvotes: 27 | cs.CV, eess.IV Authors: Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun Title: Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers Arxiv: http://arxiv.org/abs/2507.08422v1 Abstract: Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existi...

Jul 24, 2025•19 min•Ep. 1000

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

🤗 Upvotes: 23 | cs.CV, cs.CL, cs.LG Authors: Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum Title: Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning Arxiv: http://arxiv.org/abs/2507.16746v1 Abstract: Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual...

Jul 24, 2025•19 min•Ep. 999

GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

🤗 Upvotes: 98 | cs.LG, cs.AI, cs.CL, cs.CV, cs.HC Authors: Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang Title: GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding Arxiv: http://arxiv.org/abs/2507.15846v2 Abstract: Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approache...

Jul 23, 2025•25 min•Ep. 998

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

🤗 Upvotes: 93 | cs.CL Authors: Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing Title: MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization Arxiv: http://arxiv.org/abs/2507.14683v1 Abstract: Large language models have recently evolved from fluent text generation to advanced reasoni...

Jul 23, 2025•23 min•Ep. 997

The Invisible Leash: Why RLVR May Not Escape Its Origin

🤗 Upvotes: 63 | cs.LG, cs.AI, cs.CL Authors: Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi Title: The Invisible Leash: Why RLVR May Not Escape Its Origin Arxiv: http://arxiv.org/abs/2507.14843v1 Abstract: Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoni...

Jul 23, 2025•24 min•Ep. 996

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

🤗 Upvotes: 36 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, Aleksandr Gordeev Title: NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining Arxiv: http://arxiv.org/abs/2507.14119v1 Abstract: Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions...

Jul 23, 2025•23 min•Ep. 995

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

🤗 Upvotes: 31 | cs.CL, cs.AI Authors: Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou Title: WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization Arxiv: http://arxiv.org/abs/2507.15061v1 Abstract: The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-...

Jul 23, 2025•21 min•Ep. 994

GR-3 Technical Report

🤗 Upvotes: 29 | cs.RO, cs.AI, cs.CV Authors: Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, Yichu Yang Title: GR-3 Technical Report Arxiv: http://arxiv.org/abs/2507.15493v2 Abstract: We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-la...

Jul 23, 2025•26 min•Ep. 993

Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling

🤗 Upvotes: 28 | cs.CV, cs.AI Authors: Hayeon Kim, Ji Ha Jang, Se Young Chun Title: Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling Arxiv: http://arxiv.org/abs/2507.11061v2 Abstract: Recent advances in 3D neural representations and instance-level editing models have enabled the efficient creation of high-quality 3D content. However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsisten...

Jul 23, 2025•22 min•Ep. 992

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

🤗 Upvotes: 24 | cs.CV, cs.AI Authors: Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang Title: SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction Arxiv: http://arxiv.org/abs/2507.15852v2 Abstract: Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts,...

Jul 23, 2025•22 min•Ep. 991

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

🤗 Upvotes: 22 | cs.CV, cs.LG, cs.RO Authors: Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu Title: Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos Arxiv: http://arxiv.org/abs/2507.15597v1 Abstract: We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize ...

Jul 23, 2025•24 min•Ep. 990

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

🤗 Upvotes: 45 | cs.CL Authors: Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang Title: The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs Arxiv: http://arxiv.org/abs/2507.11097v1 Abstract: Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interact...

Jul 22, 2025•20 min•Ep. 989

A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models

🤗 Upvotes: 42 | cs.CL, cs.SD, eess.AS Authors: Kirill Borodin, Nikita Vasiliev, Vasiliy Kudryavtsev, Maxim Maslov, Mikhail Gorodnichev, Oleg Rogov, Grach Mkrtchian Title: A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models Arxiv: http://arxiv.org/abs/2507.13563v1 Abstract: Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural i...

Jul 22, 2025•20 min•Ep. 988

A Survey of Context Engineering for Large Language Models

🤗 Upvotes: 96 | cs.CL Authors: Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu Title: A Survey of Context Engineering for Large Language Models Arxiv: http://arxiv.org/abs/2507.13334v1 Abstract: The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engin...

Jul 19, 2025•27 min•Ep. 987

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

🤗 Upvotes: 52 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia Title: VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning Arxiv: http://arxiv.org/abs/2507.13348v1 Abstract: Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not requ...

Jul 19, 2025•23 min•Ep. 986

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning

🤗 Upvotes: 36 | cs.CV Authors: Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He Title: $π^3$: Scalable Permutation-Equivariant Visual Geometry Learning Arxiv: http://arxiv.org/abs/2507.13347v1 Abstract: We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconst...

Jul 19, 2025•20 min•Ep. 985

The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

🤗 Upvotes: 33 | cs.CL Authors: Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen Title: The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner Arxiv: http://arxiv.org/abs/2507.13332v1 Abstract: Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-drive...

Jul 19, 2025•24 min•Ep. 984

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

🤗 Upvotes: 30 | cs.CV Authors: Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu Title: AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning Arxiv: http://arxiv.org/abs/2507.12841v1 Abstract: Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protoco...

Jul 19, 2025•23 min•Ep. 983

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

🤗 Upvotes: 29 | cs.CV Authors: Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yifan Yang, Yujun Shen, Hujun Bao, Xiaowei Zhou Title: Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Arxiv: http://arxiv.org/abs/2507.13344v1 Abstract: This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion mo...

Jul 19, 2025•23 min•Ep. 982

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android