Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

🤗 Upvotes: 37 | cs.CL, cs.AI, cs.CV, cs.LG Authors: Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun Title: Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO Arxiv: http://arxiv.org/abs/2505.22453v1 Abstract: Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-mod...

May 30, 2025•23 min•Ep. 831

SageAttention2++: A More Efficient Implementation of SageAttention2

🤗 Upvotes: 33 | cs.LG, cs.AI, cs.AR, cs.CV Authors: Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, Jianfei Chen Title: SageAttention2++: A More Efficient Implementation of SageAttention2 Arxiv: http://arxiv.org/abs/2505.21136v2 Abstract: The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in att...

May 30, 2025•20 min•Ep. 830

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

🤗 Upvotes: 31 | cs.CL, cs.AI, cs.CV, cs.LG Authors: Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang Title: Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start Arxiv: http://arxiv.org/abs/2505.22334v1 Abstract: Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patter...

May 30, 2025•22 min•Ep. 829

Fostering Video Reasoning via Next-Event Prediction

🤗 Upvotes: 27 | cs.CV, cs.AI, cs.CL Authors: Haonan Wang, Hongfu Liu, Xiangyan Liu, Chao Du, Kenji Kawaguchi, Ye Wang, Tianyu Pang Title: Fostering Video Reasoning via Next-Event Prediction Arxiv: http://arxiv.org/abs/2505.22457v1 Abstract: Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering...

May 30, 2025•25 min•Ep. 828

RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination

🤗 Upvotes: 26 | cs.GR, cs.CV, cs.LG Authors: Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong Title: RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination Arxiv: http://arxiv.org/abs/2505.21925v1 Abstract: We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taki...

May 30, 2025•23 min•Ep. 827

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

🤗 Upvotes: 85 | cs.AI, cs.CL, cs.CV, cs.HC Authors: Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu Title: ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Arxiv: http://arxiv.org/abs/2505.19897v1 Abstract: Large Language Models (LLMs) have ...

May 29, 2025•22 min•Ep. 826

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

🤗 Upvotes: 73 | cs.AI, cs.CV Authors: Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, Xiangyu Yue Title: MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs Arxiv: http://arxiv.org/abs/2505.21327v1 Abstract: Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning,...

May 29, 2025•21 min•Ep. 825

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

🤗 Upvotes: 73 | cs.CV, cs.AI, cs.CL, cs.MA Authors: Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr Title: Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers Arxiv: http://arxiv.org/abs/2505.21497v1 Abstract: Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark...

May 29, 2025•18 min•Ep. 824

OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data

🤗 Upvotes: 57 | cs.CV Authors: Yiren Song, Cheng Liu, Mike Zheng Shou Title: OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data Arxiv: http://arxiv.org/abs/2505.18445v1 Abstract: Diffusion models have advanced image stylization significantly, yet two core challenges persist: (1) maintaining consistent stylization in complex scenes, particularly identity, composition, and fine details, and (2) preventing style degradation in image-to-image pipelines with style LoRA...

May 29, 2025•24 min•Ep. 823

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

🤗 Upvotes: 49 | cs.CV, cs.AI Authors: Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, Li Yuan Title: OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation Arxiv: http://arxiv.org/abs/2505.20292v3 Abstract: Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we pro...

May 29, 2025•20 min•Ep. 822

SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond

🤗 Upvotes: 43 | cs.AI, cs.CL Authors: Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, Junxian He Title: SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond Arxiv: http://arxiv.org/abs/2505.19641v3 Abstract: Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhanc...

May 29, 2025•22 min•Ep. 821

Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

🤗 Upvotes: 41 | cs.CL, cs.AI Authors: Michael Hassid, Gabriel Synnaeve, Yossi Adi, Roy Schwartz Title: Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning Arxiv: http://arxiv.org/abs/2505.17813v1 Abstract: Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inferen...

May 29, 2025•19 min•Ep. 820

Exploring the Latent Capacity of LLMs for One-Step Text Generation

🤗 Upvotes: 40 | cs.CL, cs.AI, cs.LG Authors: Gleb Mezentsev, Ivan Oseledets Title: Exploring the Latent Capacity of LLMs for One-Step Text Generation Arxiv: http://arxiv.org/abs/2505.21189v1 Abstract: A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one specially trained input embedding. In this work, we explore whether such reconstruction is possible without autoregression. We s...

May 29, 2025•21 min•Ep. 819

Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

🤗 Upvotes: 39 | cs.CL, cs.AI Authors: Amirhosein Ghasemabadi, Keith G. Mills, Baochun Li, Di Niu Title: Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence Arxiv: http://arxiv.org/abs/2505.20325v1 Abstract: Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introd...

May 29, 2025•22 min•Ep. 818

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

🤗 Upvotes: 35 | cs.CL, cs.CV Authors: Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang Title: VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization Arxiv: http://arxiv.org/abs/2505.19000v1 Abstract: Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-b...

May 29, 2025•21 min•Ep. 817

Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model

🤗 Upvotes: 178 | cs.CL, cs.AI Authors: Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan Title: Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model Arxiv: http://arxiv.org/abs/2505.17894v1 Abstract: We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including m...

May 28, 2025•21 min•Ep. 816

Shifting AI Efficiency From Model-Centric to Data-Centric Compression

🤗 Upvotes: 124 | cs.CL, cs.AI, cs.CV Authors: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang Title: Shifting AI Efficiency From Model-Centric to Data-Centric Compression Arxiv: http://arxiv.org/abs/2505.19147v1 Abstract: The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scali...

May 28, 2025•22 min•Ep. 815

Alchemist: Turning Public Text-to-Image Data into Generative Gold

🤗 Upvotes: 58 | cs.CV Authors: Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin Title: Alchemist: Turning Public Text-to-Image Data into Generative Gold Arxiv: http://arxiv.org/abs/2505.19297v1 Abstract: Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, it...

May 28, 2025•19 min•Ep. 814

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

🤗 Upvotes: 56 | cs.AI, cs.CE, cs.CL Authors: Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu Title: BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs Arxiv: http://arxiv.org/abs/2505.19457v1 Abstract: Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically...

May 28, 2025•24 min•Ep. 813

PATS: Process-Level Adaptive Thinking Mode Switching

🤗 Upvotes: 44 | cs.CL Authors: Yi Wang, Junxiao Liu, Shimao Zhang, Jiajun Chen, Shujian Huang Title: PATS: Process-Level Adaptive Thinking Mode Switching Arxiv: http://arxiv.org/abs/2505.19250v1 Abstract: Current large-language models (LLMs) typically adopt a fixed reasoning strategy, either simple or complex, for all questions, regardless of their difficulty. This neglect of variation in task and reasoning process complexity leads to an imbalance between performance and efficiency. Existing me...

May 28, 2025•21 min•Ep. 812

Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance

🤗 Upvotes: 42 | cs.CL Authors: Taeyoon Kwon, Dongwook Choi, Sunghwan Kim, Hyojun Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, Jinyoung Yeo Title: Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance Arxiv: http://arxiv.org/abs/2505.16348v1 Abstract: Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with s...

May 28, 2025•21 min•Ep. 811

ARM: Adaptive Reasoning Model

🤗 Upvotes: 40 | cs.CL Authors: Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, Yanghua Xiao Title: ARM: Adaptive Reasoning Model Arxiv: http://arxiv.org/abs/2505.20258v1 Abstract: While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention...

May 28, 2025•23 min•Ep. 810

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

🤗 Upvotes: 33 | cs.CL, cs.AI Authors: Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang Title: Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles Arxiv: http://arxiv.org/abs/2505.19914v1 Abstract: Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiab...

May 28, 2025•21 min•Ep. 809

Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

🤗 Upvotes: 33 | cs.CL, cs.AI Authors: Junnan Liu, Hongwei Liu, Linchen Xiao, Shudong Liu, Taolin Zhang, Zihan Ma, Songyang Zhang, Kai Chen Title: Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective Arxiv: http://arxiv.org/abs/2505.19815v1 Abstract: We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to th...

May 28, 2025•21 min•Ep. 808

B-score: Detecting biases in large language models using response history

🤗 Upvotes: 25 | cs.LG, cs.CL Authors: An Vo, Mohammad Reza Taesiri, Daeyoung Kim, Anh Totti Nguyen Title: B-score: Detecting biases in large language models using response history Arxiv: http://arxiv.org/abs/2505.18545v1 Abstract: Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversat...

May 28, 2025•23 min•Ep. 807

TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations

🤗 Upvotes: 95 | cs.LG, cs.CL Authors: Alan Arazi, Eilam Shapira, Roi Reichart Title: TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations Arxiv: http://arxiv.org/abs/2505.18125v1 Abstract: While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees (GBDTs). However, recent advancements are paving the way for Tabular Foundation Models...

May 27, 2025•21 min•Ep. 806

QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

🤗 Upvotes: 60 | cs.CL Authors: Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan Title: QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning Arxiv: http://arxiv.org/abs/2505.17667v1 Abstract: Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoni...

May 27, 2025•24 min•Ep. 805

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

🤗 Upvotes: 55 | cs.LG Authors: Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh Title: Quartet: Native FP4 Training Can Be Optimal for Large Language Models Arxiv: http://arxiv.org/abs/2505.14669v1 Abstract: The rapid advancement of large language models (LLMs) has been paralleled by unprecedented increases in computational demands, with training costs for state-of-the-art models doubling every few months. Training mod...

May 27, 2025•23 min•Ep. 804

Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models

🤗 Upvotes: 51 | cs.AI Authors: Doohyuk Jang, Yoonjeon Kim, Chanjae Park, Hyun Ryu, Eunho Yang Title: Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models Arxiv: http://arxiv.org/abs/2505.17225v1 Abstract: Large language models have demonstrated remarkable proficiency in long and complex reasoning tasks. However, they frequently exhibit a problematic reliance on familiar reasoning patterns, a phenomenon we term \textit{reasoning rigidity}. Despite explicit instructi...

May 27, 2025•21 min•Ep. 803

One RL to See Them All: Visual Triple Unified Reinforcement Learning

🤗 Upvotes: 51 | cs.CV, cs.CL Authors: Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan Title: One RL to See Them All: Visual Triple Unified Reinforcement Learning Arxiv: http://arxiv.org/abs/2505.18129v1 Abstract: Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptio...

May 27, 2025•20 min•Ep. 802

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android