Daily Paper Cast - podcast cover

Daily Paper Cast

Jingwen Liang, Gengyu Wangdailypapercast.transistor.fm
We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art
Last refreshed:
Follow this podcast in the Metacast mobile app to refresh it and see new episodes.
Download Metacast podcast app
Podcasts are better in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

🤗 Upvotes: 83 | cs.CL Authors: Sergey Pletenev, Maria Marina, Nikolay Ivanov, Daria Galimzianova, Nikita Krayko, Mikhail Salnikov, Vasily Konovalov, Alexander Panchenko, Viktor Moskvoretskii Title: Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA Arxiv: http://arxiv.org/abs/2505.21115v1 Abstract: Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the tempo...

Jun 10, 202522 minEp. 891

FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

🤗 Upvotes: 27 | cs.SD, cs.AI, eess.AS Authors: Shunian Chen, Xinyuan Xie, Zheshu Chen, Liyan Zhao, Owen Lee, Zhan Su, Qilin Sun, Benyou Wang Title: FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion Arxiv: http://arxiv.org/abs/2506.01111v1 Abstract: High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily du...

Jun 10, 202521 minEp. 890

MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang Title: MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Arxiv: http://arxiv.org/abs/2506.05523v1 Abstract: Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in thr...

Jun 10, 202523 minEp. 889

Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs

🤗 Upvotes: 25 | cs.CL Authors: Ananth Muppidi, Abhilash Nandy, Sambaran Bandyopadhyay Title: Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs Arxiv: http://arxiv.org/abs/2506.05629v1 Abstract: The performance of large language models in domain-specific tasks necessitates fine-tuning, which is computationally expensive and technically challenging. This paper focuses on parameter-efficient fine-tuning using soft prompting, a promising approach that adapts pre-trained models to...

Jun 10, 202520 minEp. 888

SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

🤗 Upvotes: 39 | cs.CV Authors: Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang Title: SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training Arxiv: http://arxiv.org/abs/2506.05301v1 Abstract: Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during infere...

Jun 07, 202522 minEp. 887

ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

🤗 Upvotes: 38 | cs.CL, cs.CV Authors: Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang Title: ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development Arxiv: http://arxiv.org/abs/2506.05010v1 Abstract: We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and u...

Jun 07, 202521 minEp. 886

Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

🤗 Upvotes: 32 | cs.LG, cs.CL Authors: Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets Title: Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts Arxiv: http://arxiv.org/abs/2506.05229v1 Abstract: Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. Ho...

Jun 07, 202520 minEp. 885

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

🤗 Upvotes: 32 | cs.RO, cs.AI, cs.CV Authors: Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, Shanghang Zhang Title: RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics Arxiv: http://arxiv.org/abs/2506.04308v1 Abstract: Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language mod...

Jun 07, 202524 minEp. 884

Video World Models with Long-term Spatial Memory

🤗 Upvotes: 30 | cs.CV Authors: Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein Title: Video World Models with Long-term Spatial Memory Arxiv: http://arxiv.org/abs/2506.05284v1 Abstract: Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, le...

Jun 07, 202522 minEp. 883

Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights

🤗 Upvotes: 27 | cs.AI Authors: Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, Mickaël Chen, Alexandra D. Constantinou, Antoine d'Andigné, Hubert de La Jonquière, Aurélien Delfosse, Ludovic Denoyer, Alexis Deprez, Augustin Derupti, Michael Eickenberg, Mathïs Federico, Charles Kantor, Xavier Koegler, Yann Labbé, Matthew C. H. Lee, Erwan Le Jumeau de Kergaradec, Amir Mahla, Avshalom Manevich, ...

Jun 07, 202525 minEp. 882

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

🤗 Upvotes: 24 | cs.CL Authors: Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou Title: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models Arxiv: http://arxiv.org/abs/2506.05176v1 Abstract: In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upo...

Jun 07, 202521 minEp. 881

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

🤗 Upvotes: 23 | cs.CV Authors: Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng Title: VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models Arxiv: http://arxiv.org/abs/2505.23656v1 Abstract: Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their...

Jun 07, 202523 minEp. 880

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

🤗 Upvotes: 22 | cs.CL, cs.LG Authors: Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray Title: The Common Pile...

Jun 07, 202518 minEp. 879

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

🤗 Upvotes: 21 | cs.CV Authors: Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Khan Title: VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos Arxiv: http://arxiv.org/abs/2506.05349v1 Abstract: Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten o...

Jun 07, 202521 minEp. 878

MiMo-VL Technical Report

🤗 Upvotes: 58 | cs.CL Authors: Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, ...

Jun 06, 202519 minEp. 877

Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

🤗 Upvotes: 41 | cs.LG, cs.AI, cs.CL, cs.CV Authors: Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, Yu Cheng Title: Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning Arxiv: http://arxiv.org/abs/2506.04207v1 Abstract: Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (M...

Jun 06, 202520 minEp. 876

AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment

🤗 Upvotes: 39 | cs.LG, cs.AI, cs.CL, cs.RO Authors: Anastasiia Ivanova, Eva Bakaeva, Zoya Volovikova, Alexey K. Kovalev, Aleksandr I. Panov Title: AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment Arxiv: http://arxiv.org/abs/2506.04089v1 Abstract: As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge...

Jun 06, 202521 minEp. 875

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

🤗 Upvotes: 35 | cs.AR, cs.AI, cs.CL, cs.LG, cs.PL Authors: Ahmed Heakl, Sarim Hashmi, Gustavo Bertolo Stahl, Seung Hun Eddie Han, Salman Khan, Abdulrahman Mahmoud Title: CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark Arxiv: http://arxiv.org/abs/2505.16968v3 Abstract: We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA <--> HIP) and assembly-level (Nvidia SASS <--> AMD RD...

Jun 06, 202523 minEp. 874

A Controllable Examination for Long-Context Language Models

🤗 Upvotes: 30 | cs.CL Authors: Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z. Pan, Ivan Titov Title: A Controllable Examination for Long-Context Language Models Arxiv: http://arxiv.org/abs/2506.02921v1 Abstract: Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world and synthetic tasks. Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks are too complex to interpret...

Jun 06, 202522 minEp. 873

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

🤗 Upvotes: 25 | cs.CV, cs.CL Authors: Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao Title: MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos Arxiv: http://arxiv.org/abs/2506.04141v1 Abstract: The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mai...

Jun 06, 202523 minEp. 872

Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis

🤗 Upvotes: 23 | cs.CL Authors: Kejian Zhu, Shangqing Tu, Zhuoran Jin, Lei Hou, Juanzi Li, Jun Zhao Title: Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis Arxiv: http://arxiv.org/abs/2506.04142v1 Abstract: The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on construc...

Jun 06, 202520 minEp. 871

SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models

🤗 Upvotes: 23 | cs.CL Authors: Yuhao Wu, Yushi Bai, Zhiqiang Hu, Juanzi Li, Roy Ka-Wei Lee Title: SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models Arxiv: http://arxiv.org/abs/2506.04180v1 Abstract: Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter...

Jun 06, 202522 minEp. 870

Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

🤗 Upvotes: 144 | cs.CL Authors: Shelly Bensal, Umar Jamil, Christopher Bryant, Melisa Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, Waseem AlShikh Title: Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning Arxiv: http://arxiv.org/abs/2505.24726v1 Abstract: We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrect...

Jun 05, 202523 minEp. 869

VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

🤗 Upvotes: 51 | cs.AI Authors: Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang Title: VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments Arxiv: http://arxiv.org/abs/2506.02387v1 Abstract: Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenar...

Jun 05, 202524 minEp. 868

UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

🤗 Upvotes: 49 | cs.CV, cs.AI, cs.CL Authors: Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan Title: UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation Arxiv: http://arxiv.org/abs/2506.03147v2 Abstract: Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image ...

Jun 05, 202519 minEp. 867

SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis

🤗 Upvotes: 46 | cs.LG, cs.CL, cs.CV Authors: Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, Michael Qizhe Shieh Title: SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Arxiv: http://arxiv.org/abs/2506.02096v1 Abstract: Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this e...

Jun 05, 202518 minEp. 866

CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

🤗 Upvotes: 43 | cs.CV, cs.AI Authors: Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, Xuchen Song Title: CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs Arxiv: http://arxiv.org/abs/2505.24120v1 Abstract: Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remains inadequately assessed. Current multimodal benchmarks pr...

Jun 05, 202521 minEp. 865

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

🤗 Upvotes: 29 | cs.CL, cs.AI, cs.CV Authors: Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao Title: GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents Arxiv: http://arxiv.org/abs/2506.03143v1 Abstract: One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the app...

Jun 05, 202522 minEp. 864

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

🤗 Upvotes: 29 | cs.CV, cs.RO Authors: Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu, Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, Xizhou Zhu Title: Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces Arxiv: http://arxiv.org/abs/2506.00123v1 Abstract: The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing ...

Jun 05, 202522 minEp. 863

OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation

🤗 Upvotes: 28 | cs.AI Authors: Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang Title: OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation Arxiv: http://arxiv.org/abs/2506.02397v1 Abstract: Recent advanced large reasoning models (LRMs) leverage extended chain-of-thought (CoT) reasoning to solve complex tasks, achieving state-of-the-art performance. Despite their success, we identify a critical ...

Jun 05, 202524 minEp. 862
Hosted on Transistor
For the best experience, listen in Metacast app for iOS or Android