Daily Paper Cast - podcast cover

Daily Paper Cast

Jingwen Liang, Gengyu Wangβ€’dailypapercast.transistor.fm
We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art
Last refreshed: β“˜
Follow this podcast in the Metacast mobile app to refresh it and see new episodes.
Download Metacast podcast app
Podcasts are better in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

πŸ€— Upvotes: 67 | cs.CV Authors: Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang Title: Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning Arxiv: http://arxiv.org/abs/2505.03318v1 Abstract: Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging...

May 08, 2025β€’22 minβ€’Ep. 741

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

πŸ€— Upvotes: 63 | cs.LG, cs.AI, cs.CL Authors: Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang Title: Absolute Zero: Reinforced Self-play Reasoning with Zero Data Arxiv: http://arxiv.org/abs/2505.03335v2 Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works tha...

May 08, 2025β€’25 minβ€’Ep. 740

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

πŸ€— Upvotes: 23 | cs.CL, cs.AI, cs.LG, I.2.7 Authors: Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah Title: RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale Arxiv: http://arxiv.org/abs/2505.03005v1 Abstract: We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converte...

May 08, 2025β€’21 minβ€’Ep. 739

FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

πŸ€— Upvotes: 21 | cs.CV, cs.AI, cs.MM Authors: Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang Title: FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios Arxiv: http://arxiv.org/abs/2505.03730v1 Abstract: Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, sk...

May 08, 2025β€’20 minβ€’Ep. 738

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

πŸ€— Upvotes: 56 | cs.AI, cs.CL, cs.SD Authors: Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu Title: Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Arxiv: http://arxiv.org/abs/2505.02707v1 Abstract: A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, ...

May 07, 2025β€’23 minβ€’Ep. 737

RM-R1: Reward Modeling as Reasoning

πŸ€— Upvotes: 48 | cs.CL, cs.AI, cs.LG Authors: Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji Title: RM-R1: Reward Modeling as Reasoning Arxiv: http://arxiv.org/abs/2505.02387v1 Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM)...

May 07, 2025β€’23 minβ€’Ep. 736

Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

πŸ€— Upvotes: 44 | cs.CL, cs.AI, cs.LG, I.2.7; I.2.6; I.2.3; I.7 Authors: Roman Abramov, Felix Steinbauer, Gjergji Kasneci Title: Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers Arxiv: http://arxiv.org/abs/2504.20752v1 Abstract: Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated t...

May 07, 2025β€’21 minβ€’Ep. 735

Practical Efficiency of Muon for Pretraining

πŸ€— Upvotes: 30 | cs.LG, stat.ML Authors: Essential AI, :, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani Title: Practical Efficiency of Muon for Pretraining Arxiv: http://arxiv.org/abs/2505.02222v1 A...

May 07, 2025β€’23 minβ€’Ep. 734

PixelHacker: Image Inpainting with Structural and Semantic Consistency

πŸ€— Upvotes: 24 | cs.CV Authors: Ziyang Xu, Kangsheng Duan, Xiaolei Shen, Zhifeng Ding, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang Title: PixelHacker: Image Inpainting with Structural and Semantic Consistency Arxiv: http://arxiv.org/abs/2504.20438v2 Abstract: Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demo...

May 06, 2025β€’18 minβ€’Ep. 733

A Survey of Interactive Generative Video

πŸ€— Upvotes: 31 | cs.CV Authors: Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, Xihui Liu Title: A Survey of Interactive Generative Video Arxiv: http://arxiv.org/abs/2504.21853v1 Abstract: Interactive Generative Video (IGV) has emerged as a crucial technology in response to the growing demand for high-quality, interactive video content across various domains. In this paper, we define IGV as a technology that combines generative capabilities to...

May 03, 2025β€’23 minβ€’Ep. 732

DeepCritic: Deliberate Critique with Large Language Models

πŸ€— Upvotes: 27 | cs.CL, cs.AI, cs.LG Authors: Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen Title: DeepCritic: Deliberate Critique with Large Language Models Arxiv: http://arxiv.org/abs/2505.00662v1 Abstract: As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on study...

May 03, 2025β€’22 minβ€’Ep. 731

Sadeed: Advancing Arabic Diacritization Through Small Language Model

πŸ€— Upvotes: 44 | cs.CL, cs.AI Authors: Zeina Aldallal, Sara Chrouf, Khalil Hennara, Mohamed Motaism Hamed, Muhammad Hreden, Safwan AlModhayan Title: Sadeed: Advancing Arabic Diacritization Through Small Language Model Arxiv: http://arxiv.org/abs/2504.21635v1 Abstract: Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language...

May 02, 2025β€’22 minβ€’Ep. 730

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

πŸ€— Upvotes: 27 | cs.CL, cs.AI, cs.IR Authors: Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, Zhicheng Dou Title: WebThinker: Empowering Large Reasoning Models with Deep Research Capability Arxiv: http://arxiv.org/abs/2504.21776v1 Abstract: Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, kn...

May 02, 2025β€’21 minβ€’Ep. 729

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

πŸ€— Upvotes: 24 | cs.CL Authors: Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, Weizhu Chen Title: Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math Arxiv: http://arxiv.org/abs/2504.21233v1 Abstract: Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly ...

May 02, 2025β€’20 minβ€’Ep. 728

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

πŸ€— Upvotes: 22 | cs.CV Authors: Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Olga Russakovsky Title: COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning Arxiv: http://arxiv.org/abs/2504.21850v1 Abstract: Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be par...

May 02, 2025β€’19 minβ€’Ep. 727

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

πŸ€— Upvotes: 49 | cs.LG, cs.AI, cs.CL Authors: Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen Title: Reinforcement Learning for Reasoning in Large Language Models with One Training Example Arxiv: http://arxiv.org/abs/2504.20571v1 Abstract: We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizin...

May 01, 2025β€’22 minβ€’Ep. 726

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

πŸ€— Upvotes: 44 | cs.CL, cs.AI, cs.CV, cs.IR, cs.LG Authors: Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang Title: UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities Arxiv: http://arxiv.org/abs/2504.20734v1 Abstract: Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG app...

May 01, 2025β€’22 minβ€’Ep. 725

ReasonIR: Training Retrievers for Reasoning Tasks

πŸ€— Upvotes: 36 | cs.AI, cs.CL, cs.IR, cs.LG Authors: Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, Luke Zettlemoyer Title: ReasonIR: Training Retrievers for Reasoning Tasks Arxiv: http://arxiv.org/abs/2504.20595v1 Abstract: We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part becaus...

May 01, 2025β€’21 minβ€’Ep. 724

The Leaderboard Illusion

πŸ€— Upvotes: 36 | cs.AI, cs.CL, cs.LG, stat.ME Authors: Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet ÜstΓΌn, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker Title: The Leaderboard Illusion Arxiv: http://arxiv.org/abs/2504.20879v1 Abstract: Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot...

May 01, 2025β€’21 minβ€’Ep. 723

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

πŸ€— Upvotes: 28 | cs.CL Authors: Zae Myung Kim, Chanwoo Park, Vipul Raheja, Dongyeop Kang Title: Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models Arxiv: http://arxiv.org/abs/2504.20157v1 Abstract: Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We ...

May 01, 2025β€’21 minβ€’Ep. 722

RepText: Rendering Visual Text via Replicating

πŸ€— Upvotes: 22 | cs.CV Authors: Haofan Wang, Yujia Xu, Yimeng Li, Junchen Li, Chaowei Zhang, Jing Wang, Kejia Yang, Zhibo Chen Title: RepText: Rendering Visual Text via Replicating Arxiv: http://arxiv.org/abs/2504.19724v1 Abstract: Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address t...

Apr 30, 2025β€’22 minβ€’Ep. 721

Towards Understanding Camera Motions in Any Video

πŸ€— Upvotes: 127 | cs.CV, cs.AI, cs.CL, cs.LG, cs.MM Authors: Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan Title: Towards Understanding Camera Motions in Any Video Arxiv: http://arxiv.org/abs/2504.15376v1 Abstract: We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench cons...

Apr 29, 2025β€’22 minβ€’Ep. 720

Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

πŸ€— Upvotes: 43 | cs.CV Authors: Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou Title: Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning Arxiv: http://arxiv.org/abs/2504.16656v2 Abstract: We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learn...

Apr 29, 2025β€’21 minβ€’Ep. 719

BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

πŸ€— Upvotes: 25 | cs.CL, cs.LG Authors: Hongyu Wang, Shuming Ma, Furu Wei Title: BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs Arxiv: http://arxiv.org/abs/2504.18415v1 Abstract: Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-...

Apr 29, 2025β€’21 minβ€’Ep. 718

Step1X-Edit: A Practical Framework for General Image Editing

πŸ€— Upvotes: 55 | cs.CV Authors: Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang Title: Step1X-Edit: A Practical Framework for General Image Editing Arxiv: http://arxiv.org/abs/2504.17761v1 Abstract: In recent years, image editing models have witnessed remarkable and...

Apr 26, 2025β€’21 minβ€’Ep. 717

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

πŸ€— Upvotes: 50 | cs.CL Authors: Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang Title: Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning Arxiv: http://arxiv.org/abs/2504.17192v1 Abstract: Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at u...

Apr 26, 2025β€’22 minβ€’Ep. 716

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

πŸ€— Upvotes: 47 | cs.CV Authors: Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan Szpektor Title: RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation Arxiv: http://arxiv.org/abs/2504.17502v1 Abstract: Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a r...

Apr 26, 2025β€’20 minβ€’Ep. 715

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

πŸ€— Upvotes: 28 | cs.CV Authors: Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng Title: Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs Arxiv: http://arxiv.org/abs/2504.17432v1 Abstract: The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its eff...

Apr 26, 2025β€’25 minβ€’Ep. 714

DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning

πŸ€— Upvotes: 39 | cs.CV, cs.AI Authors: Fulong Ye, Miao Hua, Pengze Zhang, Xinghui Li, Qichao Sun, Songtao Zhao, Qian He, Xinglong Wu Title: DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning Arxiv: http://arxiv.org/abs/2504.14509v2 Abstract: In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping...

Apr 25, 2025β€’21 minβ€’Ep. 713

Trillion 7B Technical Report

πŸ€— Upvotes: 27 | cs.CL, cs.AI, cs.LG Authors: Sungjun Han, Juyoung Suk, Suyeong An, Hyungguk Kim, Kyuseok Kim, Wonsuk Yang, Seungtaek Choi, Jamin Shin Title: Trillion 7B Technical Report Arxiv: http://arxiv.org/abs/2504.15431v1 Abstract: We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and J...

Apr 25, 2025β€’26 minβ€’Ep. 712
Hosted on Transistor
For the best experience, listen in Metacast app for iOS or Android