Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

VMoBA: Mixture-of-Block Attention for Video Diffusion Models

🤗 Upvotes: 26 | cs.CV Authors: Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong Title: VMoBA: Mixture-of-Block Attention for Video Diffusion Models Arxiv: http://arxiv.org/abs/2506.23858v1 Abstract: The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as t...

Jul 02, 2025•17 min•Ep. 921

Calligrapher: Freestyle Text Image Customization

🤗 Upvotes: 24 | cs.CV Authors: Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen Title: Calligrapher: Freestyle Text Image Customization Arxiv: http://arxiv.org/abs/2506.24123v1 Abstract: We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of preci...

Jul 02, 2025•23 min•Ep. 920

BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

🤗 Upvotes: 46 | cs.GR, cs.CV Authors: Jiacheng Chen, Ramin Mehran, Xuhui Jia, Saining Xie, Sanghyun Woo Title: BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing Arxiv: http://arxiv.org/abs/2506.17450v2 Abstract: We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (l...

Jul 01, 2025•22 min•Ep. 919

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

🤗 Upvotes: 30 | cs.CV, cs.AI, cs.HC, cs.MM Authors: Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou Title: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs Arxiv: http://arxiv.org/abs/2506.21862v1 Abstract: In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic...

Jul 01, 2025•21 min•Ep. 918

XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

🤗 Upvotes: 25 | cs.CV Authors: Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu Title: XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation Arxiv: http://arxiv.org/abs/2506.21416v1 Abstract: Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformer...

Jul 01, 2025•24 min•Ep. 917

Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback

🤗 Upvotes: 39 | cs.CL Authors: Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, Daniel Khashabi Title: Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback Arxiv: http://arxiv.org/abs/2506.11930v1 Abstract: Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-per...

Jun 17, 2025•25 min•Ep. 916

Effective Red-Teaming of Policy-Adherent Agents

🤗 Upvotes: 33 | cs.MA, cs.AI, cs.CL, cs.CR Authors: Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor Title: Effective Red-Teaming of Policy-Adherent Agents Arxiv: http://arxiv.org/abs/2506.09600v1 Abstract: Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing an...

Jun 17, 2025•20 min•Ep. 915

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

🤗 Upvotes: 29 | cs.CV Authors: Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyoung Kim, Seungryong Kim, Jin-Hwa Kim Title: Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation Arxiv: http://arxiv.org/abs/2506.11924v1 Abstract: We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative ...

Jun 17, 2025•21 min•Ep. 914

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

🤗 Upvotes: 63 | cs.CL, cs.AI, cs.MA Authors: Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu Title: ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning Arxiv: http://arxiv.org/abs/2506.09513v1 Abstract: Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address t...

Jun 14, 2025•22 min•Ep. 913

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

🤗 Upvotes: 40 | cs.SE, cs.AI Authors: Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, Zibin Zheng Title: SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks Arxiv: http://arxiv.org/abs/2506.10954v1 Abstract: Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, ...

Jun 14, 2025•22 min•Ep. 912

Text-Aware Image Restoration with Diffusion Models

🤗 Upvotes: 34 | cs.CV, cs.AI, cs.LG Authors: Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim Title: Text-Aware Image Restoration with Diffusion Models Arxiv: http://arxiv.org/abs/2506.09993v1 Abstract: Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in deg...

Jun 14, 2025•24 min•Ep. 911

AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

🤗 Upvotes: 30 | cs.MA, cs.CV Authors: Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang Title: AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation Arxiv: http://arxiv.org/abs/2506.10540v1 Abstract: Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, ...

Jun 14, 2025•20 min•Ep. 910

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

🤗 Upvotes: 29 | cs.CV, cs.AI, cs.MM Authors: Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang Title: VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos Arxiv: http://arxiv.org/abs/2506.10857v1 Abstract: We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, a...

Jun 14, 2025•22 min•Ep. 909

Discrete Audio Tokens: More Than a Survey!

🤗 Upvotes: 24 | cs.SD, cs.AI, cs.CL, eess.AS Authors: Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli Title: Discrete Audio Tokens: More Than a Survey! Arxiv: http://arxiv.org/abs/2506.10274v1 Abstract: Discrete audio tokens are comp...

Jun 14, 2025•25 min•Ep. 908

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

🤗 Upvotes: 76 | cs.CL, cs.LG Authors: Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, Ivan Oseledets Title: Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models Arxiv: http://arxiv.org/abs/2506.06395v3 Abstract: Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose ...

Jun 13, 2025•21 min•Ep. 907

Seedance 1.0: Exploring the Boundaries of Video Generation Models

🤗 Upvotes: 49 | cs.CV Authors: Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang, Jianchao Yang, Runkai Yang, Tao Yang, Yihang Yang, Zilyu Ye, Xuejiao Zeng, Yan Zeng, Heng Zhang, Yang Zhao, Xiao...

Jun 13, 2025•21 min•Ep. 906

Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

🤗 Upvotes: 37 | cs.LG Authors: Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, Beidi Chen Title: Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation Arxiv: http://arxiv.org/abs/2506.09991v1 Abstract: Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce parad...

Jun 13, 2025•21 min•Ep. 905

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

🤗 Upvotes: 36 | cs.CV, cs.AI, cs.LG Authors: Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, Lu Jiang Title: Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation Arxiv: http://arxiv.org/abs/2506.09350v1 Abstract: Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AA...

Jun 13, 2025•26 min•Ep. 904

ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

🤗 Upvotes: 34 | cs.CL, cs.CV, cs.SE Authors: Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang Title: ComfyUI-R1: Exploring Reasoning Models for Workflow Generation Arxiv: http://arxiv.org/abs/2506.09790v1 Abstract: AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestra...

Jun 13, 2025•23 min•Ep. 903

PlayerOne: Egocentric World Simulator

🤗 Upvotes: 26 | cs.CV Authors: Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, Hengshuang Zhao Title: PlayerOne: Egocentric World Simulator Arxiv: http://arxiv.org/abs/2506.09995v1 Abstract: We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that a...

Jun 13, 2025•20 min•Ep. 902

Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation

🤗 Upvotes: 26 | cs.SD, cs.AI, cs.LG, eess.AS Authors: Or Tal, Felix Kreuk, Yossi Adi Title: Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation Arxiv: http://arxiv.org/abs/2506.08570v2 Abstract: Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significan...

Jun 13, 2025•22 min•Ep. 901

Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models

🤗 Upvotes: 54 | cs.CL Authors: Mikhail Salnikov, Dmitrii Korzh, Ivan Lazichny, Elvir Karimov, Artyom Iudin, Ivan Oseledets, Oleg Y. Rogov, Alexander Panchenko, Natalia Loukachevitch, Elena Tutubalina Title: Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models Arxiv: http://arxiv.org/abs/2506.06751v1 Abstract: This paper evaluates geopolitical biases in LLMs with respect to various countries though an analysis of their interpretation ...

Jun 12, 2025•21 min•Ep. 900

Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL Authors: Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang Title: Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better Arxiv: http://arxiv.org/abs/2506.09040v1 Abstract: Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inabilit...

Jun 12, 2025•21 min•Ep. 899

RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

🤗 Upvotes: 25 | cs.CL, cs.AI, cs.LG Authors: Yang Liu, Jiaqi Li, Zilong Zheng Title: RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling Arxiv: http://arxiv.org/abs/2506.08672v1 Abstract: Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoni...

Jun 12, 2025•20 min•Ep. 898

Reinforcement Pre-Training

🤗 Upvotes: 150 | cs.CL Authors: Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei Title: Reinforcement Pre-Training Arxiv: http://arxiv.org/abs/2506.08007v1 Abstract: In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next tok...

Jun 11, 2025•20 min•Ep. 897

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

🤗 Upvotes: 62 | cs.LG, cs.AI, cs.CR Authors: Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong Title: Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance Arxiv: http://arxiv.org/abs/2506.06444v1 Abstract: Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has si...

Jun 11, 2025•21 min•Ep. 896

MiniCPM4: Ultra-Efficient LLMs on End Devices

🤗 Upvotes: 60 | cs.CL, cs.AI Authors: MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Zekai Qu, Qundong Shi, Zijun...

Jun 11, 2025•20 min•Ep. 895

SpatialLM: Training Large Language Models for Structured Indoor Modeling

🤗 Upvotes: 31 | cs.CV Authors: Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, Zihan Zhou Title: SpatialLM: Training Large Language Models for Structured Indoor Modeling Arxiv: http://arxiv.org/abs/2506.07491v1 Abstract: SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their seman...

Jun 11, 2025•22 min•Ep. 894

Image Reconstruction as a Tool for Feature Analysis

🤗 Upvotes: 27 | cs.CV, 68T10, 68T30, 68T45, I.2.10 Authors: Eduard Allakhverdov, Dmitrii Tarasov, Elizaveta Goncharova, Andrey Kuznetsov Title: Image Reconstruction as a Tool for Feature Analysis Arxiv: http://arxiv.org/abs/2506.07803v1 Abstract: Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here...

Jun 11, 2025•22 min•Ep. 893

Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning

🤗 Upvotes: 25 | cs.RO, cs.AI Authors: Sheng Chen, Peiyu He, Jiaxin Hu, Ziyang Liu, Yansheng Wang, Tao Xu, Chi Zhang, Chongchong Zhang, Chao An, Shiyu Cai, Duo Cao, Kangping Chen, Shuai Chu, Tianwei Chu, Mingdi Dan, Min Du, Weiwei Fang, Pengyou Fu, Junkai Hu, Xiaowei Jiang, Zhaodi Jiang, Fuxuan Li, Jun Li, Minghui Li, Mingyao Li, Yanchang Li, Zhibin Li, Guangming Liu, Kairui Liu, Lihao Liu, Weizhi Liu, Xiaoshun Liu, Yufei Liu, Yunfei Liu, Qiang Lu, Yuanfei Luo, Xiang Lv, Hongying Ma, Sai Ma, Lin...

Jun 11, 2025•22 min•Ep. 892

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android