Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Qwen2.5-1M Technical Report

🤗 Upvotes: 26 | cs.CL Authors: An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, Zipeng Zhang Title: Qwen2.5-1M Technical Report Arxiv: http://arxiv.org/abs/2501.15383v1 Abstract: We introduce Qwen2.5-1M, a series of models that exte...

Jan 29, 2025•24 min•Ep. 441

ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

🤗 Upvotes: 13 | cs.CL Authors: Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao Title: ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer Arxiv: http://arxiv.org/abs/2501.15570v1 Abstract: As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressive...

Jan 29, 2025•21 min•Ep. 440

Towards General-Purpose Model-Free Reinforcement Learning

🤗 Upvotes: 13 | cs.LG, cs.AI Authors: Scott Fujimoto, Pierluca D'Oro, Amy Zhang, Yuandong Tian, Michael Rabbat Title: Towards General-Purpose Model-Free Reinforcement Learning Arxiv: http://arxiv.org/abs/2501.16142v1 Abstract: Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored to specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods...

Jan 29, 2025•21 min•Ep. 439

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

🤗 Upvotes: 11 | cs.SD, cs.CL, eess.AS Authors: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu Title: Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation Arxiv: http://arxiv.org/abs/2501.15907v1 Abstract: Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short ...

Jan 29, 2025•22 min•Ep. 438

iFormer: Integrating ConvNet and Transformer for Mobile Application

🤗 Upvotes: 9 | cs.CV, cs.AI Authors: Chuanyang Zheng Title: iFormer: Integrating ConvNet and Transformer for Mobile Application Arxiv: http://arxiv.org/abs/2501.15369v1 Abstract: We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are ...

Jan 29, 2025•24 min•Ep. 437

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

🤗 Upvotes: 7 | cs.CV, cs.AI, cs.LG, q-bio.NC Authors: Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper Title: Are Vision Language Models Texture or Shape Biased and Can We Steer Them? Arxiv: http://arxiv.org/abs/2403.09193v1 Abstract: Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classi...

Jan 29, 2025•25 min•Ep. 436

CodeMonkeys: Scaling Test-Time Compute for Software Engineering

🤗 Upvotes: 5 | cs.LG Authors: Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, Azalia Mirhoseini Title: CodeMonkeys: Scaling Test-Time Compute for Software Engineering Arxiv: http://arxiv.org/abs/2501.14723v1 Abstract: Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem...

Jan 29, 2025•23 min•Ep. 435

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

🤗 Upvotes: 4 | cs.LG, cs.AI Authors: Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak Title: Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models Arxiv: http://arxiv.org/abs/2501.12370v2 Abstract: Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the num...

Jan 29, 2025•21 min•Ep. 434

Humanity's Last Exam

🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan Kumar, Oleksandr Pokutnyi, Robert Gerbicz, Serguei Popov, John-Clark Levin, Mstyslav Kazakov, Joh...

Jan 28, 2025•23 min•Ep. 433

Chain-of-Retrieval Augmented Generation

🤗 Upvotes: 26 | cs.IR, cs.CL Authors: Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, Furu Wei Title: Chain-of-Retrieval Augmented Generation Arxiv: http://arxiv.org/abs/2501.14342v1 Abstract: This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer. Conventional RAG methods usually perform a single retrieval step before the generation process, which limits their effectiv...

Jan 28, 2025•23 min•Ep. 432

Redundancy Principles for MLLMs Benchmarks

🤗 Upvotes: 22 | cs.CL, cs.AI Authors: Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai Title: Redundancy Principles for MLLMs Benchmarks Arxiv: http://arxiv.org/abs/2501.13953v1 Abstract: With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redund...

Jan 28, 2025•22 min•Ep. 431

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

🤗 Upvotes: 13 | cs.CL, cs.AI, cs.LG Authors: Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin Title: RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques Arxiv: http://arxiv.org/abs/2501.14492v1 Abstract: Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and s...

Jan 28, 2025•24 min•Ep. 430

RL + Transformer = A General-Purpose Problem Solver

🤗 Upvotes: 7 | cs.LG, cs.AI Authors: Micah Rentschler, Jesse Roberts Title: RL + Transformer = A General-Purpose Problem Solver Arxiv: http://arxiv.org/abs/2501.14176v1 Abstract: What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., meta-learn)? In this study, we demonstrate that a pre-trained transformer fine-tuned with reinforcement learning over multiple episodes develops the ability to solve problem...

Jan 28, 2025•24 min•Ep. 429

Relightable Full-Body Gaussian Codec Avatars

🤗 Upvotes: 5 | cs.CV, cs.GR Authors: Shaofei Wang, Tomas Simon, Igor Santesteban, Timur Bagautdinov, Junxuan Li, Vasu Agrawal, Fabian Prada, Shoou-I Yu, Pace Nalbone, Matt Gramlich, Roman Lubachersky, Chenglei Wu, Javier Romero, Jason Saragih, Michael Zollhoefer, Andreas Geiger, Siyu Tang, Shunsuke Saito Title: Relightable Full-Body Gaussian Codec Avatars Arxiv: http://arxiv.org/abs/2501.14726v1 Abstract: We propose Relightable Full-Body Gaussian Codec Avatars, a new approach for modeling relig...

Jan 28, 2025•21 min•Ep. 428

Question Answering on Patient Medical Records with Private Fine-Tuned LLMs

🤗 Upvotes: 4 | cs.CL, cs.AI Authors: Sara Kothari, Ayush Gupta Title: Question Answering on Patient Medical Records with Private Fine-Tuned LLMs Arxiv: http://arxiv.org/abs/2501.13687v1 Abstract: Healthcare systems continuously generate vast amounts of electronic health records (EHRs), commonly stored in the Fast Healthcare Interoperability Resources (FHIR) standard. Despite the wealth of information in these records, their complexity and volume make it difficult for users to retrieve and inter...

Jan 28, 2025•22 min•Ep. 427

GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

🤗 Upvotes: 3 | cs.CV Authors: Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S. Khan, Salman Khan Title: GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing Arxiv: http://arxiv.org/abs/2501.13925v1 Abstract: Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models pe...

Jan 28, 2025•23 min•Ep. 426

AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation

🤗 Upvotes: 2 | cs.CV Authors: Yuning Cui, Syed Waqas Zamir, Salman Khan, Alois Knoll, Mubarak Shah, Fahad Shahbaz Khan Title: AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation Arxiv: http://arxiv.org/abs/2403.14614v1 Abstract: In the image acquisition process, various forms of degradation, including noise, haze, and rain, are frequently introduced. These degradations typically arise from the inherent limitations of cameras or unfavorable ambient conditions. To rec...

Jan 28, 2025•21 min•Ep. 425

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

🤗 Upvotes: 2 | cs.CV Authors: Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas Title: Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning Arxiv: http://arxiv.org/abs/2411.19458v1 Abstract: Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear....

Jan 28, 2025•24 min•Ep. 424

SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

🤗 Upvotes: 46 | cs.LG, cs.AI, cs.MA, I.2.11 Authors: Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev Title: SRMT: Shared Memory for Multi-agent Lifelong Pathfinding Arxiv: http://arxiv.org/abs/2501.13200v1 Abstract: Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. T...

Jan 25, 2025•24 min•Ep. 423

Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

🤗 Upvotes: 33 | cs.CL Authors: Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang Title: Sigma: Differential Rescaling of Query, Key and Value for Efficient...

Jan 25, 2025•21 min•Ep. 422

Improving Video Generation with Human Feedback

🤗 Upvotes: 30 | cs.CV, cs.AI, cs.GR, cs.LG Authors: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang Title: Improving Video Generation with Human Feedback Arxiv: http://arxiv.org/abs/2501.13918v1 Abstract: Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misa...

Jan 25, 2025•24 min•Ep. 421

Temporal Preference Optimization for Long-Form Video Understanding

🤗 Upvotes: 15 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO Authors: Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy Title: Temporal Preference Optimization for Long-Form Video Understanding Arxiv: http://arxiv.org/abs/2501.13919v1 Abstract: Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization...

Jan 25, 2025•25 min•Ep. 420

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

🤗 Upvotes: 14 | cs.CV, cs.AI, cs.CL Authors: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng Title: Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step Arxiv: http://arxiv.org/abs/2501.13926v1 Abstract: Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying...

Jan 25, 2025•21 min•Ep. 419

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

🤗 Upvotes: 10 | cs.CV, cs.CL Authors: Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu Title: Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Arxiv: http://arxiv.org/abs/2501.13826v1 Abstract: Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating...

Jan 25, 2025•21 min•Ep. 418

DiffuEraser: A Diffusion Model for Video Inpainting

🤗 Upvotes: 8 | cs.CV Authors: Xiaowen Li, Haolan Xue, Peiran Ren, Liefeng Bo Title: DiffuEraser: A Diffusion Model for Video Inpainting Arxiv: http://arxiv.org/abs/2501.10018v1 Abstract: Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounte...

Jan 25, 2025•22 min•Ep. 417

IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

🤗 Upvotes: 8 | cs.CV, cs.CL, cs.LG Authors: Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li Title: IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models Arxiv: http://arxiv.org/abs/2501.13920v1 Abstract: With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abi...

Jan 25, 2025•29 min•Ep. 416

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

🤗 Upvotes: 7 | cs.LG, cs.AI Authors: Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang Title: Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback Arxiv: http://arxiv.org/abs/2501.10799v1 Abstract: Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought p...

Jan 25, 2025•21 min•Ep. 415

One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt

🤗 Upvotes: 5 | cs.CV, cs.AI, cs.LG Authors: Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng Title: One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt Arxiv: http://arxiv.org/abs/2501.13554v1 Abstract: Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for ...

Jan 25, 2025•22 min•Ep. 414

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

🤗 Upvotes: 109 | cs.CL, cs.AI, cs.LG Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting C...

Jan 24, 2025•21 min•Ep. 413

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

🤗 Upvotes: 44 | cs.CV Authors: Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao Title: VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Arxiv: http://arxiv.org/abs/2501.13106v2 Abstract: In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philoso...

Jan 24, 2025•23 min•Ep. 412

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android